Research on Tourist Routes Recommendation Based on the User Preference Drifting Over Time

Tourist routes recommendation is a way to improve the tourist experience and the efficiency of tourism companies. Session-based methods divide all users’ interaction histories into the same number sessions with fixed time window and treat the user preference as time sequences. There have few or even no interaction in some sessions for some users because of the high sparsity and temporal characteristics of tourist data. That lead to many session-based methods can not be applied to routes recommendation due to aggravate the sparsity. In order to better adapt and apply the characteristics of tourism data and alleviate the sparsity, a tourist routes recommendation method based on the user preference drifting over time is proposed. Firstly, the sparsity, temporal context, tourist age and price characteristics of tourism data are analyzed on a real tourism data. Secondly, based on the results of analysis, tourist interaction history is dynamic divided into different number of sessions and the tourist’s evolving profile is then constructed by mining his probabilistic topic distribution in each session using Latent Dirichlet Allocation (LDA) and the time penalty weights. Then, the tourist feature vector based on the tourist age, the price and season of his tourism is modeled and a set of nearest neighbors and the candidate routes is selected base on it. Finally, the routes are recommended according to the similarities of probabilistic topic distributions between the active tourist and routes. Experimental results show that the proposed method can not only effectively adapt to the characteristics of tourism data, but also improve the effect of recommendation


INTRODUCTION
With the improvement of people's living standards, tourism has become an important way of leisure and entertainment. According to the statistics in recent years, the number of tourists and tourism income are growth in the speed of more than 10%. In order to attract more tourists, tourism companies need to understand the needs of tourists and develop a variety of attractive routes. But it is difficult for tourists to choose their own routes from a large number of routes. Therefore, obtaining the user's travel needs to recommend the tourist routes has became a problem to be solved in the tourism industry.
Recommender systems [1] which are the main ways to solve some users if their histories are divided into the same number sessions with fixed time window because of the high sparsity, the seasonal and temporal of the tourism data. That will not only aggravate the sparsity, but also make many session-based methods unavailable directly to routes recommendation. Although many recommendation algorithms and systems for tourists have emerged, for example, He et. al. dug out the hidden theme from documents based on LDA and determined which route is most suitable to the user according to grades generated for each user to each route by using Collaborative Filtering algorithm [9]. Tourist route intelligent recommendation system based on Hadoop used distributed association rules calculation to solve the secure storage and the fast access of the large amounts of data [10], these works can not meet the unique characteristics of tourism data.
In this paper, in order to adapt to the high sparsity, seasonal and temporal characteristics and patterns of tourism data, and model the user preference drift over time to improve the accuracy of the tourism routes, a route recommendation model based on dynamic dividing a user's interaction history is proposed. Firstly, the high sparsity, temporal, price and age characteristics of tourism data are analyze. Secondly, every tourist history is divided into different stages based on the temporal characteristics of tourism data, and the tourist evolving preference is modeled by extracting the probabilistic topic model which represents the user latent interest in every stage using LDA and defining the time decreasing weight. Then, the tourist's feature vector is established according to the characteristics of the age of tourists, tourism season and price to obtain a set of neighbors and candidate routes for an active tourist. Finally, the routes are recommended based on the relevant model of the probabilistic topic distribution between candidate routes and an active tourist. A large number of experimental results on actual tourism data set show that the method can effectively use the characteristics of the tourism data, and recommend accurately the routes.
The major contributions of this paper are as follows: • We demonstrate the high sparsity and temporal characteristics of tourism data. The analysis shows that it is unreasonable to divide tourism history into the same number stages with fixed time window which can worsen sparsity issue.
• we propose a novel method which dynamic divides every user's interaction history into different number stages based on the tourism data and the user's preference drift over time is modeled by using LDA for each session and a time sensitive weighting scheme to capture the user's evolving interest.
• We employ LDA as the language model to detect the probabilistic topic distribution for each tourist in every session which represents the latent factors affecting the tourist's choice on routes. It is easy to mining the preference changes for each tourist on topic and predict the tourist's preference trend on routes, that is important to build profile.
• We take temporal context, tourist age, route price and travel season into account when the neighbors are selected to recommend the routes to the active tourist, which is suitable for the characteristic of actual tourism data.
• We conduct a large set of experiments to evaluate the performance of our method and compare our method with other state-of-the-art methods.
The remainder of this paper is organized as follows. In section 2, we provide a brief overview of related works. Section 3 analyses the sparsity and temporal characteristics of tourism data and proposes a novel method to divide the tourism history into sessions and models the user's preference based on temporal domain division by using LDA and a time sensitive weighting scheme. The method of neighbor selection and routes recommendation is introduced in section 4. The results of an empirical analysis are presented in section 5, followed by a conclusion in section 6.

RELATED WORKS
In this section, we briefly review tourism recommendation systems, LDA-based recommendation and session-based temporal dynamic recommendation.

Tourism Recommendation
The whole process of tourism is covered by recommendation. Routes can be recommended before trip, personalized service can be introduced by combining mobile devices with context aware information during the process of it and feedback can be obtained after the end of the trip. Zhu et. al. modeled user in both geographical space and semantic space, defined Activity Pattern and extracted routes which matched individual's activity patterns from high similar users' trajectories to recommend topk routes to a user [11]. Based on the Web GIS technology, liu et. al. designed a novel personalized smart system which was highlighted at spending the least travel cost to reach as many destinations as possible within a specified time period [12]. Hasuike et. al. introduced a Time-Expanded Network (TEN) to solve the problem of randomly changing of traveling and sightseeing times and selected the next sightseeing site through conditional probabilities calculated by current conditions, statistical and Web data [13]. Shen, Tong and Chen developed a two-step greedy-based heuristic algorithm to conduct strategic multipleevent planning for every user and consider the constraints of spatio-temporal conflicts and travel expenditure to address the data sparsity problem in destination prediction [14]. Xue et. al.
proposed sub-trajectory synthesis method which decomposed historical trajectories into sub-trajectories comprising two adjacent locations and connected them into "synthesis" trajectories [15,16]. Su et. al. presented the CrowPlanner system which requested human workers to evaluate candidates routes recommended by different sources and methods and determined the best route based on the feedback of these workers [17]. Devasanthiya et. al. described a recommender system, which obtained textual messages, classified them using Rocchio classification and yielded the recommendation results using ontological specifications, to help travel agents in recommending tourism options to the customers [18]. The authors combined ontology-based semantic similarity measure to evaluate semantic similarity between items or to recommend personalized information [19,20].
In [21], the authors propose methods to provide passengers with useful information, the probability of taking a taxi and the average waiting time to facilitate their taxi-taking. In [22], Shen, Deng and Gao designed personalized attraction similarity model to suggest attractions by leveraging explicit user interaction and heterogeneous multi-modality travel information and refined the recommendation by adopting to context information such as the user location at a particular moment. Zheng explored the feasibility of promoting circuitous tourism through the recommendation of highly acclaimed tour routes [23].  [26]. In [27], based on an evaluation of similarity between the plot of a watched video by a user and a large amount of plots stored in a movie database, a plot-based recommendation system was proposed, which implemented and compared the two Topic Models, Latent Semantic Allocation (LSA) and LDA.

Combining Session-based Temporal Dynamic CF with LDA-based Recommendation
Session-based temporal Collaborative filtering model a user profile by dividing the user interaction history into stages. Yu and Zhu introduced an enhanced session-based temporal graph model considering three features to capture personal and temporal user interest and subsequently recommended personalized hashtags combining long-term and short-term user interest [4]. In [28], the authors presented incremental session-based collaborative filtering with forgetting mechanism in music recommendation systems, which considered music listened continuously and maintains the recent session. Ricardo et. al. defined the temporal information and the diversity of sessions and complete music recommendation using session-based collaborative filtering [29]. Xiang constructed a temporal graph to simultaneously model the user long-term and short-term preference [5]. Zheleva et. al. used LDA to build hierarchical graph after dividing a user interaction history into sessions [6]. Li et. al. divided the user interaction history into stages with a fixed time window and recommend the news group using the user long-term preference and specific news using the short-term preference [7]. Hong et. al.
categorized the items to establish the user long-term preference, identify the user's current stage and provide the recommended list [8].

MODELING THE TOURIST PREFER-ENCE DRIFT OVER TIME
In this section, we firstly conduct studies on the characteristics analysis of tourism data in section A. In section B, based on the identified characteristics, we dynamically divide a tourist interaction history into sessions which alleviate the high sparsity of tourism data.

Characteristics of Tourism Data
Compared to other data sets, there are many characteristics of tourism data, such as the higher sparsity, temporal features of tourism, statistics characteristics on the age of tourists and the price of routes.

High Sparsity
The tourism data is more sparse compared to other standard data sets because the number of travel is very limited and the number of shopping or watching movies is very common. We use the percentage of tourists who travel times to show its sparsity. It is defined as (1) Where N num i represents the number of tourists who travel N i times and T num represents the total number of tourists. With the increase of N i , the more sparse the data is ,the smaller the percentage is and most of tourists should be centered on the area of smaller N i .

Temporal Features of Tourism
Tourism is an important way of leisure and entertainment, and it is easy to be influenced by the factors of the season and the leisure time. Assume that the tourist u i has the entertainment time e i and a year is divided into stages s i (1 ≤ i ≤ 12) by months. The probability of a tourism routes R i in the stage s i is selected by u i and u i may be to travel in s i is defined respectively by (2) and (3) p where corr (e i |s i ) represents the correlations between the leisure and entertainment time e i and the stage s i the route R i belongs to. a is a number that is either close to 0 or close to 1. It represents that tourists are more willing to travel in their leisure and entertainment time and the time is relatively fixed in each year.

Statistics on the Age of Tourists and the Price of Routes
The age distribution of tourists is related to whether the tourists have spare time or have a strong economic capacity. The distribution of each age session when we divide the age of tourists into 6 sessions is defined by (4) where e u i and c u i respectively represent the entertainment time and economic capacity of tourist u i , age j is the age session the tourist u i belong to. Unlike movies or shopping, the price is independent of the time, while the price of routes is often associated with the long of tourism time. The longer the tourism time is, the higher the price is. We split the price of all routes into 7 sections and then the influence of each price section is defined by (5) f Where num R k is the number of routes whose price are in the section. num Total is the total number of all routes. Therefore, we should take into account the characteristics of tourism data when the user preference is modeled to design a suitable recommendation algorithm and get better recommendation effect.

3.5
Modeling the Tourist Preference Drift over Time

Division of Tourist Interaction History
Session-based CF divide the user interaction history with fixed time window into sessions and the user profile is expressed as a sequence of these stages. However, the division with fixed time window is not suitable for tourism data because of the high sparsity of tourism data and relative fixed time slots of the tourists travel. That will lead to completely no tourism behavior in some stage if we divide it with fixed time window. So we should dynamic divide the tourist interaction history into stages with variable time window based on the characteristics of the actual interaction of each tourist. Generally, we define a set of tourists U = {u 1 , u 2 , ..., u m }, a set of routes L = {l 1 , l 2 , ..., l n }, the interaction history of tourist u i is H u i . First, we set the smallest temporal domain with the size of δ based on the average time interval of tourist u i . Next, we select any record as a center point and compute the distance between it and other existing records. if the distance is greater than δ , the other record will be considered as a new center, otherwise, the two records will be merged into a clustering and calculate their center as a new center point. The distances between the new center point and the rest of records are recalculated and the process is continue until the results of any two adjacent point are not changed. Based on this process, H u i , the interaction his- where H u i is the number of center points, H t u i 1 ≤ t ≤ H u i is the records of the tourist u i belong to the t th clustering, H H u i u i is the latest one. Therefore, the interaction history of u i will be dynamic divided into stages according to his own results that the different tourism records will be merged. We regard that the strategy our proposed not only can be applied to our data set, but also to other tourism data sets, and even to other types of data sets as long as the interaction history is segmented and data is sparse.

Generation of Probabilistic Topic Distribution Based on LDA
LDA as the language model is employed in recommendation.
The general framework has a natural interpretation when dealing with users' preference data: the set of users define the corpus, each user is considered as a document, the items purchased are considered as words, the ratings are considered as the appeared frequency. In this paper, tourist records H t u i of the tourist u i have the corresponding detailed tourism document L t u i , so the LDA regard every document as a distribution of a group of topics to detect the probabilistic topic distribution for each tourist, and each topic is considered as the distribution of words about the description of some route. Therefore, the tourism document is firstly preprocessed by removing disable words and the low frequency words, and intelligently segmented based on forward iteration fine-grained segmentation. Then we can obtain the Polynomial distributions φ j k and θ kl of topic-word and topic-document using LDA which are represent by (6) and (7), respectively.
Where L j is the number of words in a document L j , n j k is the number of times that a word w j is given to T k , T k is the k th component of the topic vector T = {T 1 , T 2 , ..., T K }, K is the number of topics, m kl is the number of times that a document L l is given to T k , α and β are the super parameter of θ and φ, respectively. In practice, the default values of α and β are often set to 50/K and 0.01. So we can obtain the probabilistic topic distribution of the tourist u i in t th stage P t u i = ( p t u i ,1 , p t u i ,2 , · · · , p t u i ,K ) T . The probabilistic topic distribution of the whole tourism history of the tourist u i is expressed as

(3) Tourist Preference Drift over Time
The probabilistic topic distribution of the later stage is more important than others because the later preference is more consistent with the current interest. So the probabilistic topic distributions of different stages have different weights to predict the tourist current interest. The preference drift over time is defined as (8) Where λ t u i is the time decreasing weight of the tourist u i in t th stage. It is defined by (9) It can be seen from (9) that λ t u i ∈ [0, 1] and the larger t represents the later stage of history. The higher value of λ t u i is, the greater contribution of probabilistic topic distribution is when modeling the tourist preference drift over time.

TOURISM ROUTES RECOMMENDA-TION
In this section, we firstly analyze the difficulty of neighbor selection in tourism data and propose a method of neighbor selection based on the tourist feature vector. Then, we can get the candidate set of routes based on the temporal features and recommend routes to the active tourist.

Selection of Tourist Neighbor
Due to the high sparsity of tourist data, the common routes between tourists are very few. Fig. 1 shows the change of the percentage of tourists who have the common routes. From Fig.  1 we can see that the common routes of over 95% tourists are less than 3, and the proportion of its within one month is slightly higher. That shows that the number of common routes of all tourists are very few, less than 5 times, which lead to the difficulty of selection of nearest neighbors whom the time of the common routes are relatively close. So we can use these characteristics of tourism data in the course of recommending routes. Based on a detailed analysis of the characteristics of the tourism data and the segmentation of these attributes in section III, the travel time of tourists is divided into four periods based on spring, summer, autumn and winter. The age of them is divided into six stages. We divide the price of routes into seven sections. We establish the d dimensional feature vector Then the similarities between tourists are calculated based on the feature vectors of tourists and is defined as (10)

Generation of Candidate Routes
We can see that the tourists are more willing to travel in the relative specific month in each year based on the analysis of section III and the time of the common routes the is relatively close based on IV. Based on these analysis, we set the three tuple after a year is divided into 12 stages in accordance with months, which represents the routes set L m f u i of the tourist u i in the month of m f . The sets of temporal neighbors and candidate routes are obtained from (11) and (12), respectively.
Where N u i is the neighbors obtained from (10).

Tourism Routes Recommendation
The probabilistic topic distribution P L l for each route in the candidate set S u i of tourist u i is obtained using LDA, and the preference drift over time of tourist u i in the stage H u i + 1 is calculated by using (8), which are K dimensional vectors in the probabilistic topic space. The common similarity measure used in recommender systems are cosine similarity, adjusted cosine similarity and pearson correlation-based similarity. Cosine similarity can measure the similarity between vectors using the cosine value of the angle of them. So we compute the similarity between the probabilistic topic distribution P L l and P H u i +1 u i using Cosine similarity to obtain the correlation degree between the user predicted preference and the route by (13).
Where L l ∈ S u i , we rank the similarities and recommend top-k routes to the active tourist.

EXPERIMENTS AND RESULTS
In this section, we conduct comprehensive experimental evaluation to show efficacy and effectiveness of our proposed method.
First, we introduce the tourism dataset used in our paper and metric for performance comparison which is defined by us based on the tourism characteristics. Then, we present the experimental results including key parameters finding, model validation and comparison studies.

Dataset
The real tourism data comes from a tourism company belong to Xiamen airlines. The data set contains 732019 travel records from January 2009 to October 2014. In this paper, we extract 25717 records from 4737 tourists on 1436 routes. The tourists with less than 3 times records are removed. Currently, the data set is not published on the Internet because of the privacy of tourists.

Metric
Precision, Recall and F-score are metrics to evaluate the performance of top-k recommendation. In the experimental process, we treat the first H u i − 1 times data for each tourist as the training set, the last H u i data as test set. Because of the high sparsity, the precision is almost 0 or 1/k and the recall is 0 or 1 to evaluate the recommendation results because the number of recommend correctly route is nearly 0 or 1. In this paper, we propose precision coverage as the evaluation metric, which is defined as (14) Where |U | is the number of tourists, and ρ u i is defined by (15) is the set of top-k routes recommended to the active tourist u i .

Analysis of Tourism Data
We compare the tourism data set to a standard movie recommendation data set (Movelens). In order to make a better comparison, the number of ratings is 10 times the number of travel times. The result is shown in Fig. 2. From Fig. 2 it can be seen that the percentage of tourists has declined rapidly with the increase of the times of tourism, and more than 95% tourists are less than 10 times. Its percentage also decreased with the increase of the number of watched movies on Movielens, but the speed is significantly slower than the tourism data and the gap is more obvious with the larger number of them. We statistic on the tourism months of all tourists in Fig. 3. Figure 3 shows that tourists are more willing to travel in the spring and autumn when have a pleasant climate. The tourism time distribution is more concentrated for each tourist. Fig. 4 shows the month distribution of tourists. From Fig. 4 we can  see that the tourism time of more than 70% of the tourists is concentrated in 4 months, that shows that the tourism time for each tourist is a relatively fixed time in each year. So the time factor should be considered when we recommend routes to the users.  Figure 5 shows that the age of the main force of tourists is distributed in the 1-18, 26-35 and 36-50 years old, which exceed 70% of the total tourists. It is most likely because that students whose age are 1-18 years old have a lot of spare time, such as winter and summer vacation, to follow their parents or their partners to travel. The groups whose age in 26-35 and 36-50 years old have a strong economic capacity, and tourism has become a way of their leisure and entertainment.
The price of selected by tourists is shown in Fig. 6. From Fig. 6 it can be seen that the number of tourists is reduce rapidly when the price of routes is increase. About 70% of the tourists choose the routes where the price is below 500. The proportion of tourists who select the price of routes between 500-2000 is  flat, while the percentage of tourists selected 3000 and more than 3000 is lower than others and they are very close. Therefore, we can see that people prefer the routes of which the price is cheap and the time is short. The influence of price becomes smaller when the price is reach to a certain value.

(2 )Selection of the Number of Topic
The optimal number of topics K used in the LDA-based recommendation is often not learned from the data,but is predefined because the topic is latent variable. We use precision coverage as the criteria with different K values, meanwhile, record the running time because Accuracy and computational complexity are the two criteria when we decided the number of topics. The complexity of recommendation is higher because of the more topics, which lead to the larger computation. The results are shown in Fig. 7. We can see that the precision coverage is increase and then decrease with the increase of topics, while the running time is always increasing. The threshold values that lead to better precision-coverage results and smaller computational complexity is 50, which is the optimal number of topics we selected.

(3) Selection of the Number of Neighbors n
In all the neighborhood-based methods, the number of similar neighbors n is very important. We calculate precision coverage for the active user with different value of n, as shown in Fig. 8.
We notice that Precision coverage increase and then decreases when the number of neighbors is increasing. That is because the common routes are always less if the neighbors are few, which lead to a smaller candidate routes set, while the similarities between the active user and neighbors become poor if the number  of it is too large, which lead to the larger difference between the candidate routes and the actual route of active tourist. So in the following experiments, we will use 50 as the optimal number of neighbors for our approach.

Comparison with Other Methods
To evaluate the effectiveness of our method, we compare our method (called TLDA) with several representative baselines: 1)UCF (user-based collaborative filtering) [30]: a representative of user-based collaborative filtering; 2) LDA [1]: a method that the user preference is modeled using LDA and recommend items to the active user based on the user profile; 3) ItemRank [31]: a method that the association graph of routes is established to rank them using random walk. In the experiment, the LDA parameters are α = 50/K and β = 0.01, the restart probability is 0.15, and the other parameters take the optimal values for each method. The comparison results are shown in Fig. 9. We observe that (1) TLDA, LDA and ItemRank methods outperform UCF. That is because LDA-based methods (TLDA and LDA) are think that the users' interest influenced by potential factors which deeply reflect and model interaction relationships between users and items, which can better capture the user preference, and ItemRank increases the recommendation diversity using random walk to prevent over filtering, but UCF only reply on rating information, which is less informative. (2) LDA and Itemrank perform similarity, and The performance of LDA is slightly superior to that of ItemRank, which is straightforward since LDA characterize the user preference using latent factors and recommend items to the active user based on it, which better predict the trend of profile of user, while the ItemRank only use the associations between items. (3) The performance of TLDA is better than that of all other methods. The reason is that TLDA not only models the user preference drift over time using LDA, but also takes temporal context, tourist age, route price and travel season into account when we select neighbors for the active user.

CONCLUSIONS
In this paper, we propose a novel method that automatically divides a tourist's interaction history into sessions with variable sizes based on its actual travel data. LDA is then used to model tourists' preference in different sessions and the time decreasing weights are used to measure their importance to capture their dynamic preference, which is easy to mining the preference changes for each tourist on topic and predict the tourist's preference trend on routes. We take the temporal context, tourist age, route price and travel season into account when the neighbors are selected to complete recommendation. The method we proposed not only mining and adapt to the high sparsity, temporal, seasonal and price characteristics, but also alleviate the high sparsity because a tourist interaction is divided into sessions. The experimental results on real tourism data set show that our method not only can dig out the tourist preference drift over time and but also achieve better recommendation performance.