A User Preference Tree based Personalized Route Recommendation System for Constraint Tourism and Travel

: Personalized recommendation systems recommend the target destination based on user-generated data from social media and geo-tagged photos that are currently available as a most pertinent source. This paper proposes a tourism destination recommendation system which uses heterogeneous data sources that interprets both texts posted on social media and images of tourist places visited and shared by tourists. For this purpose, we propose an enhanced user profile that uses User-Location Vector with LDA and Jaccard Coefficients. Moreover, a new Tourist Destination tree is constructed using the posts extracted from TripAdvisor where each node of the destination tree consists of tourist destination data. Finally, we build a personalized recommendation system based on user preferences, A* algorithm and heuristic shortest path algorithm with cost optimization based on the backtracking based Travelling Salesman Problem solution, tourist destination tree and tree-based hybrid recommendations. Here, the 0/1 knapsack algorithm is used for recommending the best Tourist Destination travel route plans according to the travel time and cost constraints of the tourists. The experimental results obtained from this work depict that the proposed User Centric Personalized destination and travel route recommendation system is providing better recommendation of tourist places than the existing systems by handling multiple heterogeneous data sources efficiently for recommending optimal tour plans with minimum cost and time.


INTRODUCTION
Recently, the recommendation systems are dealing with the issue of information overload that is faced by users online. Initially, the recommendation systems got success on providing personalized recommendations on e-commerce applications. Later on, it was applied to all the 1. The proposal of new a method for user profile maintenance using location constraints and the application of LDA and Jaccaard coefficient for providing effective recommendations. 2. A tree based approach based on Trip_Advisor data is proposed in this paper to find the new tourist destinations more efficiently. 3. A new personalized recommendation system is proposed in this work which uses user profile information and the browsing pattern of the users. 4. The Knapsack problem solution approach is used in this work for optimizing the distance and cost of visiting places. 5. Finally, a heuristic search approach is proposed by using the combination of A* algorithm and the Travelling Sales Man problem in order to find the shortest path from the source to destination by visiting all the required places with optimal cost. The rest of this paper is formulated as below: Section 2 describes in detail about the user profiles that are created and given as insight on various tourism related recommendation systems. Section 3 explains the proposed user preference tree based personalized route recommendation system. Section 4 discusses the data set used and various evaluation parameters which are applied for evaluating the efficiency of the proposed system. Finally, section 5 concludes the work with suggestions for proceeding further in this direction. One such hybrid method of recommendation is based on the principles of fuzzy set theory. In such models, a fuzzy user preference tree is used to represent a user's preferences as stated by Esparza et al. (2013) and Wu et al. (2015). The application of fuzzy rules in these models help to handle the decision making process with incomplete data. Neriet al. (2012) proposed a personalized travel planning system which considers various types of user needs and it also provides the users with a travel schedule planning service simultaneously based on the users' needs using sentiment analysis on the data obtained from the social media. The use of sentiment analysis provides a suitable tourist location with respect to the characteristics of the user. Shenet al. (2016) discussed about their recommendation system which provides a user adopted interface and also provides adjustable results dynamically based on the user interests. Sun et al.

LITERATURE SURVEY
(2013) proposed a new personalized travel recommendation system which recommends places of tourist attraction based on user interaction and collective intelligence in a unified framework.
This model is useful to have destinations based on current interest. Chiang and Huang (2015) developed a tourist recommendation system which is capable of identifying and matching visual characteristics of tourist location images pertaining to the user's places of interest, and images of various tourist attractions. Chen et al. (2014) proposed a novel approach to recommend a travel route that helps to visit the popular tourist attractions in a destination without considering the user preferences into account. However, the recommendation systems must consider the user interests as well. explicitly from the user and hence these systems may not result as an expert opinion based travel recommendation system. In addition, Hsiu-Sen Chiang et al. (2015) in their system allowed the users to modify the results in order to make further personalization. This unsupervised human entered data may include noise and the recommendations may turn out to be unreliable and thus making the recommender system to bed error prone.
Based on these limitations, it is necessary to propose an effective and personalized recommendation system for optimizing the time and cost of tours. Therefore, a new personalized tourism recommendation system is proposed and implemented in this work which overcomes the limitations of the existing tour recommendation systems. In this model, User Profile is created by extracting implicit information available in multiple heterogeneous data sources namely Flickr and Facebook. The geo-tagged photos posted by tourists in Flickr as well as text and the images posted by tourists in Facebook are used in this work to mine the user interest. Each User Profile is stored as a user node in the proposed user preference tree structure that stores this mined explicit and implicit information of users hierarchically. The User Preference tree and Tourist Destination Tree are constructed in this model and they are compared and analyzed to give a personalized recommendation to the tourist. The analysis applies rules, temporal and cost constraints, LDA based topic modeling and Jaccard similarity measure, 0/1 Knapasack approach, A* algorithm and Travelling salesman problem based solution for performing optimal tourist place recommendations.

METHODOLOGIES
The flow of the system design is given in Figure 2. The proposed user centric personalized recommendation system collects data from multiple heterogeneous data sources such as posts and photos of users interested in traveling from the social media and interprets the posts and photos to find the topic of interest of users and recommends personalized tourist destinations.
The demographic information, posts and photos of the travelers are collected from Facebook.
The photos of tourist destinations from Flickr are collected and clustered based on geo tags and representative images for each cluster is constructed to build a database of images. These tags are then passed on to a Latent Dirichlet allocation (Blei et al., 2003) model built to obtain probabilities associated with each topic which is the first probability value to be considered in this work. Places travelled by users are collected from Facebook and this information is used in two ways: First the collected information is applied for finding the same users by applying Jaccard similarity. The descriptions of those travelled places are passed to an LDA model to obtain probabilities associated with each topic and this is the second probability value to be considered. The posts that the users post are also passed to an LDA model to obtain associated probabilities, which is the third probability value to be considered. All the three probabilities are combined by assigning weights and the resulting combined probabilities with the user-location vector are stored in a newly developed data structure as a user node. As a result of storing user nodes in the graph database, links between users can be easily constructed. Both the user preference nodes and destination tree nodes are used to generate a recommendation list for tourist destinations along with other travel information about the recommended destination. Travel route suggestion is also done using the Knapsack algorithm suggesting the best route for the tourist in the available time. The comprehensive explanation of the research design is explained in the sections below.

Traveler Information Extraction from Facebook
Initially traveler's data from various Travel Interest Groups available at Facebook is collected which consists of five hundred posts and fifty timeline photos. The ids of these popular travel bloggers are collected using Facebook Graph API, which further allow us to collect demographic information, timeline photos and posts posted by the user. The information about traveler's user id, their Facebook feed, age, gender, photos and tagged location are collected, preprocessed, tokenized and further tagged to build basic user profiles.

Users' Profile
The photos are grouped by applying their grouped geo-tags for finding the tourist locations where they were collected, and the locations are also identified and labeled with semantic information. The algorithm also creates a user-location matrix and user-user similarity matrix that can be further used for personalized tourist recommendations. To build the basic user profile, the user data are preprocessed, tokenized and POS tagged.

Preprocessing
Users' data such as user's feed, age, gender, photos and their tagged location collection are extracted from Facebook. Here, the user's data is preprocessed by removing redundant terms, unwanted data, irrelevant data, stop words and null values.

Tokenization and POS Tagging
The preprocessed user data are tokenized, and POS tagged in order to identify the tagged

Reference Image Cluster Builder from Flickr Images
The main aim is to share the images in social media that the tags are annotated with. The input to the Reference Image Cluster Builder is the set of photos of tourist destinations from Flickr that are geo-tagged. From the retrieved geo-tagged photos, photo tags are collected, and these tourist destination photos are clustered based on geo-tags.

Retrieve geo-tagged photos
Geo-tagged photos of Tourist destinations are retrieved from Flickr using the Flickr API. Photos that have tags associated with Tourist destinations and have their geo-coordinates enabled are retrieved. Once the source URLs are obtained, the actual photos are extracted.

Photo Tags Collection
The set of photo ids are collected from retrieved geo-tagged photos and these are used to obtain the tags tagged with the photos collected. All tags are collected from states of India and are clustered state-wise basis using geo-tags.

Cluster based on Geo-Tags
The geo-tags are clustered based on Latitudes and longitudes that are retrieved from the photos using the mean shift clustering algorithm to obtain state wise clusters of images. Mean shift clustering is used since it is a non-parametric clustering algorithm.

Compute representative samples for each cluster
The "message passing" concept-based clustering algorithm called Affinity Propagation (AP) for finding the difference between the data points. Affinity propagation algorithm is used to find exemplar for each image in a set of images provided. Messages are iteratively exchanged between data points until a good solution with a set of exemplars is reached. Once the algorithm converges, the exemplar for image i is selected by equation (1). (1) Where, is the responsibility, the message sent to candidate exemplars from data point , is the availability, the message sent to data point from candidate exemplar Similarly s(i,k) i.e. similarity between two images i and k is computed using the following equation as (2) where, i and k are the two images compared.
jindicates the channel index and r represent the total number of channels , are the first order moments calculated for distributing the image.
is the second order moments calculated for distributing the image.
is the third order moments calculated for distributing the image.

Enhancing User Profile
The inputs are the user's feed, demographic information like age, gender timeline photos and tagged locations. From the photos collected, representative samples are obtained from the reference image cluster builder module and the output obtained is the enhanced user profile stored as a user node. The steps involved to enhance user profiles are Mine places travelled by users, User-location Vector, Calculate user-user similarity, Compute representative sample of user images, Topic modeling of user posts, places description and high ranked places' tags and Build user node.

Mine places travelled by users
Popular travel destinations around the world visited by travelers are mined from the posts of the collected users. The list of places visited by each traveler is collected and stored.

User Location Vector Computation
The

Calculate user-user similarity
Similarity between travelers is computed based on the Jaccard similarity coefficient. The similarity coefficient is calculated between traveler pairs for recommending the various locations to various users of the same interest. By using a User-location vector, the same set of users are extracted and used for personalized recommendation based on user interest. Jaccard coefficient is computed by equation as, Where are users and , are places that are visited by and respectively, which are extracted from the User-location vector. In this work, based on the Jaccard similarity computation, it is restricted to choose the top three similar users for each user.

Build User Node
User being the center theme, a data structure (Kavithaet al., 2016) with users as its nodes would best fit this work. Here, each user is represented by the user node built from the preceding modules. Input to the user node is the set of results from topic modeling of posts, places description and high ranked places' tags, the user id and the user-location vector. After

Storage in Graph Database
The input user nodes built in the enhancing user profile module are successfully stored in graph database neo4j. Links are drawn between a user and three users most like that user, based on the Jaccard coefficients calculated. Each node is identified using the user id, which is unique for each user.

Tourist Destination Content Extraction from TripAdvisor
In the Tourist Destination content extraction phase document pre-processing and topic modeling are done which are explained below.

Document pre-processing
The user's reviews scraped the tourist data manually that are collected and formatted for

Topic Modeling
Topic modeling module builds a topic model using LDA algorithm. The LDA model is a machine learning algorithm and unsupervised that identifies latent topic data from a huge volume of documents. This method relies on a "bag of words" and treats every document as a vector of word counts. Here, selects a particular topic from a collection of available topics for every document. The selected topic of words is associated with the distributed words. The same kind of activity is repeated for all the available words in the specific document. Finally, the LDA is based on the hypothesis of a person writing a document that has specific topics in mind.
A whole tourist data extracted which is represented as a mixture of various topics of interest of the user. The probability of occurrence of can be computed using equation (5).
In the equation 5, is the probability of occurrence of the term related to travel and tourism for a given document ' ' extracted from user profile and details of tourist destination and is the topic to which the related to travel and tourism present in the belongs. Also is the probability of occurrence of term related to travel and tourism within topic related to travel and is the probability of picking a term related to travel and tourism from topic related to travel and tourism in the document. The number of topics related to travel and tourism has to be defined in advance and allows in adjusting the degree of specialization of the topics.
Using equation (5), the posterior probabilities can be estimated with equation (6) and (7) (6) where is the topic term distribution is the document topic distribution related to travel and tourism is the count of all topic-term assignments related to travel and tourism Counts the document-topic assignments related to travel and tourism and are the hyper parameters for Dirichlet priors An LDA model is built using the data set obtained from the preprocessing step.

Tourist Destination Recommendation
The recommendation of Tourist places is built using the created user preference node and tourist destination tree. A tree-based recommendation system is built which helps in the generation of travel route suggestions based on user interest. Moreover, A* algorithm and Travelling Salesman Problem (TSP) based approach have been used in this work for providing suitable location recommendations.

Build user preference node
Each user node is retrieved a unique user id and parsed for obtaining the probabilities from the database that inputs into a user preference node. Moreover, a user node contains the probabilities of each topic obtained from mining the posts, photos and places visited by the users.

Build destination tree
Tourist Destination trees are a key component in

Tree based destination recommendation
A fuzzy reference approach is used to recommend. The probabilities in the leaf node are considered as fuzzy values. The algorithm 2 is applied for generating the recommended users list for three tourist destinations of the user.

Tourist Destination Route Plan
Tourist Destinations that are recommended using the Tree based destination recommendation are passed on to the Knapsack algorithm to generate the best tourist destination route plan using the time constraint of the user as weights.

.8.1 Graph Construction
The graph construction is classified into three such as location-location graph, user-location graph and user-user graph. In the location-location graph visit the edge between two locations that are the correlation between two locations' strengths. In the User-location graph, users and the locations are the two kinds of entities in the user location graph. In the user-user graph, a node is a user and edge between two nodes represents two relations. The top 5 tourist destinations that fit the user's interests are selected and a weighted graph is constructed. The weights between the nodes being the distance between the places which is calculated using haversine distance. The haversine distance metric finalizes the great-circle distance between two points on a sphere given their longitudes and latitudes.

Algorithm: Graph Construction
Input: selected nodes, their GPS coordinates

Knapsack Algorithm
The nodes are then subjected to Knapsack algorithm that generates and suggests travel plans of different time periods by changing the capacity of the Knapsack which is the time in this case.

ALGORITHM
Step 1: Input user destination place Step2: attractions display in the recommended system Step3: Previous user already done post, share, comment to the location in particular visitors' areas Step4: Based on user preference the more weightage location will be recommended to the user Step5: Weightage is based on (exiting users likes, comments, shares,) Step 6: Choose the best location according to the user time and cost and weightage) Step7: Knapsack algorithm is to find best location according to the like, post, comment Step8: this algorithm evaluating period of time how many interesting areas to be covered in particular time length;

Travel route generation
The recommendation algorithm generates a list of recommended tourist destinations. Top five places to visit are listed out in each state based on the topic of interests of the user and popular destination travel routes in a particular state are also suggested using the Knapsack algorithm.
Knapsack algorithm which is also called a rucksack problem is a combinatorial optimization. The time period considered is from 24 hours to 5 days. Based on the time availability of the user the best travel routes are recommended.

RESULTS AND DISCUSSIONS
This section discusses the dataset and evaluation parameters for evaluating the performance of the proposed recommendation system designed and their results are also discussed.

Dataset
The dataset required for this recommendation system is obtained from multiple heterogeneous sources. Posts, photos posted by users in their Facebook page are obtained by using the Facebook Graph API. Data was collected for about thirty users, five hundred posts and fifty photos per person were collected. For reference cluster builder, fifty photos for eighteen Indian states were collected using the Flickr API. Data about tourist destinations were collected manually and crawled from the TripAdvisor website. We have crawled the tourist information from India. Since Tourism in India is a booming industry. Tourism is the second largest earner in India as foreign exchange. Tourism provides a lot of employment to the large volume of people that are skilled and unskilled. Tourism contributes 6.23% to the national GDP and 8.78% of the total employment in the form of service tax from transport, travel agencies, hostel and airline. This is compared with the topics of interest generated for each user using the LDA model.

Accuracy
Accuracy is a direct proportional to the true results, both true positive and negatives among the total number of cases examined. The formula given in equation (8) is used for evaluating the system accuracy. (8) The recommendation system {expected}, gives the set of destinations selected by user, and {obtained} gives the set of destinations recommended by the algorithm. Thus Accuracy of the recommendation system is calculated using equation 9.
Five scenarios are used while calculating accuracy to evaluate the system. Five scenarios are used while calculating accuracy to evaluate the system.
• In the first scenario, probabilities obtained from user posts are given 50% weightage, probabilities obtained from user images are given 30% and probabilities obtained from places visited by users is given 20% weightage. • In the second scenario, probabilities obtained from user posts are given 40% weightage, probabilities obtained from user images are given 30% and probabilities obtained from places visited by users is given 30% weightage.
• For the third scenario, probabilities obtained from user posts are given 20% weightage, probabilities obtained from user images are given 50% and probabilities obtained from places visited by users is given 30% weightage.
• In the fourth scenario, probabilities obtained from user posts are given 30% weightage, probabilities obtained from user images are given 60% and probabilities obtained from places visited by users is given 10% weightage.
• For fifth scenario, probabilities obtained from user posts is given 20% weightage, probabilities obtained from user images is given 30% and probabilities obtained from places visited by user is given 50% weightage The results obtained are illustrated in Figure 4.

Figure 4 Accuracy Analysis
It is evident from Figure 4 that Scenario 1 and Scenario 4 provide the best accuracy. In Scenario 1 and 4 the user post and user images are given more weightage thus giving more accuracy than other scenarios. The accuracy achieved only 60% and above not 100% because the expected topics were not obtained directly from the user instead of a human classifier.

Recall and Precision
The Recall value is calculated by using the formula which is given in equation (10) According to the recommendation system the recall is given by equation (11). (11) Precision provides the recommended items which are preferred by the user given in equation (12). (12) According to the recommendation system the precision is given by equation (13).

Specificity
The specificity is calculated by using the number true negative and false positive which means that the true negative is divided by the sum of true negative and false positive.

F-Score
The F-Score is interpreted as the weighted average of precision and recall. Here, the value 1 is best and 0 is worst. (15) The values are computed based on equation (15) and are plotted in Figure 10. On an average the system gives an F-score of 0.62. F-Score gives the measure of the accuracy of the recommendation system and the score proves that recommending using text, places visited and tourist images gives more accurate recommendation. F-score obtained was not 100% `as the expected result was obtained by using a human classifier instead of directly obtaining.

Hit rate
Hit rate can be defined as the intersection of the number of points of interest (POI)   Figure 11 shows the hit rate value in 24, 36 and 48 hours for the different user groups.

Comparative Analysis
The comparative analysis is done by considering the existing tourist recommendation systems and the proposed tourist recommendation system. Here, we have considered the time and cost for performing comparative analysis.
First, figure 12 shows the cost analysis between the proposed tourist recommendation system

Testing of Hypotheses
The following NULL hypotheses have been formulated and tested in this research work.

Hypothesis 1:
There is a significant influence between distance of places considered in the tour and the average number of people visiting the cities in a tour package.

Hypothesis 2:
There is no relationship between the cost of a tour package and the total number of people opted for the tour package.
To test: There is a significant influence between distance and cost metrics and the number of people opting for a given tour package.  Since the p-value is greater than 0.01, the null hypothesis is accepted at 1% level of significance. Therefore, there is a significant influence between the distance and cost metrics and the number of people selecting the tour package for the given places with the prescribed cost.

CONCLUSIONS AND FUTURE WORKS
In this paper, a new tourism destination recommendation system has been proposed for recommending optimal routes and locations using User preference trees, A* and TSP based heuristic search, LDA and Jaccard similarity analysis based location modeling, 0/1 knapsack problem based time and cost optimization. In this model, heterogeneous data sources that interpret both texts posted on social media and images of tourist places visited are used for analysis. A user-location vector is designed in this work to represent the relationships between user and places. The proposed user profile tree forms a graph that stores the probability with metadata of the user. This graph is used to retrieve a structure and 0/1 Knapsack algorithm is used to arrive at the optimal travel route for the user. The evaluation results obtained from this work have proved that this proposed tour recommendation system is providing more accurate recommendation of places and routes when it is compared with the other existing tour recommendation systems. This proposed system uses both text and tourist destination images and hence it is effective in both personalization and recommendation of new places. This system works totally based on the user's interests and the travel routes recommended are proved to be optimal compared to other recommendation systems that recommend based on only text or only images and heuristic search approach. As an extension, links can be drawn between users and their friends so that recommendation lists can be shared among friends. The time constraints of places can also be included in the form of intervals as a new factor while recommending travel routes to users.