Predicting Human Mobility via Long Short-Term Patterns

Predicting human mobility has great significance in Location based Social Network applications, while it is challenging due to the impact of historical mobility patterns and current trajectories. Among these challenges, historical patterns tend to be crucial in the prediction task. However, it is difficult to capture complex patterns from long historical trajectories. Motivated by recent success of Convolutional Neural Network (CNN)-based methods, we propose a Union ConvGRU (UCG) Net, which can capture long short-term patterns of historical trajectories and sequential impact of current trajectories. Specifically, we first incorporate historical trajectories into hidden states by a shared-weight layer, and then utilize a 1D CNN to capture short-term pattern of hidden states. Next, an average pooling method is involved to generate separated hidden states of historical trajectories, on which we use a Fully Connected (FC) layer to capture longterm pattern subsequently. Finally, we use a Recurrent Neural Net-work (RNN) to predict future trajectories by integrating current trajectories and long short-term patterns. Experiments demonstrate that UCG Net performs best in comparison with neural network-based methods.


Introduction
With the rapid development of social networks and location-based services, Location-Based Social Networks (LBSNs) such as Foursquare and Twitter are increasingly popular in our daily life. Many users write some short essays and share their footprints (we refer to such footprints as check-in points) on them, which enables some research efforts [1][2][3][4] in collecting users' trajectories and learning their mobility patterns. A critical task in understanding such patterns is the next location prediction, which has been extensively studied [5][6][7] in recent years. Moreover, this prediction task has great social significance in many applications, including route planning for taxi drives [8] and dispersing the crowd for public traffic [9]. Grapping human movements in advance will help such applications solve the above problems.
neural network-based models. For example, RNN-based methods are commonly used methods. Based on historical trajectories, these works can capture the sequential regularities of users' mobility preferences. Specifically, they train a RNN module to extract mobility features (such as spatial and temporal features) and predict the next location.
Although the above methods have inspiring results on human mobility prediction, several key problems still exist: (1) Sparsity of data. On the one hand, some users generate only a few check-ins. On the other hand, observable features of users' historical visits are sparse. (2) Complex mobility patterns of historical trajectories. Some users' trajectories collected from LBSNs are usually very long, while existing works tend to be difficult to exploit whole trajectories effectively. (3) Heterogeneity of data. Each user's mobility pattern affects the future check-ins individually, whereas it is difficult to capture multi users' pattern at the same time. In summary, above problems make the prediction difficult.
To address the above problems, we propose a UCG Net to predict the next locations of users. We collect context-aware check-in points of each user and generate the whole trajectory for each one firstly, and then incorporate the whole trajectory into dense representation by the embedding technology, which solve the sparsity of data. Next, we divide the whole trajectory into historical trajectory and current trajectory. Subsequently, we introduce a historical trajectory learning module to capture long short-term pattern of the historical trajectory. Specifically, a CNN architecture is used to capture short-term pattern of trajectories' hidden states; an average pooling method and an FC layer are used to capture long-term pattern of separated hidden states; and then we concatenate two patterns into a vector where reflects long short-term pattern of the historical trajectory. Subsequently, we also introduce a current trajectory learning module where captures the sequential transitions of the current trajectory with the above long short-term pattern. Finally, an FC layer and Softmax layer are involved to predict user's next location. In summary, our main contributions are: 1. We propose a UCG Net to model trajectory and predict human mobility, which consists historical and current trajectory learning modules. Historical trajectory learning module uses CNN-based architecture to proceed historical trajectory sequences and capture long short-term patterns; Current trajectory learning module uses the RNN model to capture sequential transitions of current trajectory. 2. We use a 1 Â D CNN to capture the short-term pattern of historical trajectory, and then use a FC layer to capture the long-term pattern of sub-trajectories generated by the Average-Pooling method. Generated long short-term patterns are used as RNN's context, which makes UCG Net consider dual impact of historical and current trajectory. 3. We conduct experiments on real-world datasets consisting of different check-in behavior in two cities, demonstrating that our model improves the prediction performance compared with existing strong deep neural network-based models.
The rest of this paper is organized as follows. We first review the related works in Section 2, then formulate our problem and discuss the motivation of our work in Section 3. Following that, we detail the architecture of UCG Net in Section 4. After that, we introduce the experimental configuration and discuss the experimental results in Sections 5 and 6, respectively. Finally, we conclude our paper in Section 7.

Related Work
Human mobility prediction has drawn great attentions for decades, existing research efforts can be categorized into conventional and neural network-based methods. Conventional methods pay more attention on mobility pattern recognition and users' preferences exploration. For example, the popular Matrix Factorization (MF) [10] and MF-based methods [17,18] in POI recommendation learn users' preferences by generating a matrix with user location history. T-pattern [19] and Gapped Spatiotemporal-Periodic (GSTP) [20] recognize the mobility patterns by mining spatiotemporal trajectory. These methods fail to capture the sequential information of trajectory. Therefore, they are not suitable to predict users' next visits. Afterwards, many research efforts are dedicated to solve this problem with Markov Chain (MC)-based methods [11][12][13]21]. Actually, MC-based methods can attribute to conventional methods through building different models. They learn sequential transition regularities via building a transition matrix with historical observations. For example, Mathew et al. [21] cluster locations from trajectories and train a Hidden Markov Model to capture unobserved characteristics. These methods are not efficient due to ignoring time dependence. Compared with above conventional methods, our model is able to capture the time dependence and complex transitions.
Recently, neural network-based methods [14][15][16][22][23][24][25][26][27][28] are widely used to predict human mobility. Especially RNNs such as Long Short-term Memory (LSTM) and Gated Recurrent Unit (GRU) are common methods. For example, Liu et al. [14] propose a Spatial Temporal RNN (ST-RNN) where models spatial and temporal contexts to predict next locations. Due to the lack of semantic information, Yao et al. [15] propose a semantic-aware recurrent model (SERM) where considers both sequential and semantic influence to predict future trajectory. However, these two methods cannot deeply capture the sequential information of historical trajectory. Therefore, Feng et al. [16] propose a DeepMove model to capture the multi-level periodicity of historical trajectory with an attentional mechanism based on a seq2seq model. Since it is complicated to train, Gao et al. [25] use a generative probabilistic model together with neural network and propose a VANext to improve this model more recently. They are the first to incorporate CNN to capture long term dependency among human trajectories, and improve the learning efficiency due to parallelization with GPU. Furthermore, they use a variational attention mechanism to learn the attention on the historical trajectories. Undoubtedly, these two attentional mechanism-based methods are efficient by matching current trajectory with long historical trajectory. These two methods mainly compute the contribution of each historical visit, and therefore lose the global sequential impact of historical trajectory. Compared with these neural network-based methods, our model predicts human mobility from global aspect and can learn long short-term mobility pattern of the historical trajectory.
CNNs are not popular for modeling sequence task, but they have received more and more attention on this problem. For example, Kim et al. [29] use a CNN architecture to capture local correlation of sentences and apply it on a text classification problem effectively. Recently, a CNN-based model [30] is proposed to improve machine translation performance. Two of these methods have achieved good performance in the Natural Language Process (NLP) problem. Similarly, some research efforts [31,32] use the CNN architecture to capture mobility patterns of human trajectory. Except for above VANext model, Wang et al. [31] propose a CNN-based Geo-Conv layer and learn feature maps of mobility trajectory. Compared with RNN-based models, CNNs is capable of capturing hierarchical representation of the underlying context with lower time complexity. This paper uses a CNN architecture to capture complex relation information-long-short term pattern of users' historical trajectory.

Preliminaries
In this section, we first present several basic definitions, and then formally formulate the mobility prediction problem. Finally, we discuss the motivation and overview our solution.

Problem Formulation
Context-aware spatiotemporal point p is a tuple of location identification l, timestamp t and activity identification a, i.e., p ¼ , l; a; t >. Given a user u, p indicates u visited l for activity a at time t. Then, let T u ¼ fp u 1 ; p u 2 ; p u 3 ; …; p u n g denote a time-order trajectory generated by u, where n is the length of T u . For convenience, we will omit the superscript u . Next, a trajectory T can be segmented as T ¼ ðT 1 ; T 2 ; …; T m Þ, meaning that there are m sub-trajectories ordered along the temporal dimension. Specifically, we predefine a threshold d and generate these sub-trajectories, which means that the time span of each T i is no more than d.
Let T h ¼ ðT 1 ; T 2 ; …; T m Þ denote the historical trajectory (concatenating all sub-trajectories) of the user u, and let T c ¼ ðp nþ1 ; p nþ2 ; …; p nþnÀ1 Þ denote the current trajectory. With these notations, our problem can then be formulated as: our goal is to predict u's next point p nþn , given the T h and the T c . Note that we also quantify the time interval of each p into a fixed value as in previous work [16], and a is only the supplement of location information. Thus, we actually predict the next l in p nþn .

Motivation and Overview
We briefly describe our problem of as a time-series problem. RNN is available to solve our problem since it is a powerful time-series modeling pool of capturing long-range dependencies of sequential information. However, the use of a single RNN is inefficient to predict human mobility using a single RNN. There are two reasons supporting our view.
First, the length of each user's historical trajectory tends to be long. Unfortunately, RNN can't proceed with long sequence directly. Even though using LSTM, it also has poor performance due to the vanishing gradient. As in previous work [16], it has been verified. Secondly, the sparsity of data cannot be ignored, mainly reflected in two aspects. For one thing, collected data tend to have less features (e.g., coordinate and time), which makes the RNN difficult to train; for another, the time interval of each record is usually unfixed, causing wrong sequential information of each user. In other words, it induces the problem of fragmentated trajectory.
Based on above observations, this paper proposes a novel neural network-based method to predict human mobility. Specifically, we first segment entire trajectory of each user and limit the length of subtrajectory. Next, we use an embedding layer to represent sparse features (e.g., l, a and t) and solve the problem of sparsity. We then concatenate such sub-trajectories to generate the historical trajectory. Subsequently, a CNN architecture is used to capture long short-term mobility pattern of the historical trajectory. The CNN architecture can easily represent the local relationships and long-range relationships with pooling-based methods [33]. Finally, we predict future trajectory with the RNN by generated pattern and the current trajectory.

Methodology
In this section, we first introduce our basic framework of prediction model. Subsequently, we detail some important modules of UCG Net. Finally, we interpret our model again and introduce the parameter learning of one. For convenience, the key mathematical notations used in the methodology are shown in Tab. 1. Fig. 1 shows the framework of UCG Net. It mainly consists of four components: trajectory embedding, current trajectory learning module, historical trajectory learning module and classifier.

Basic Framework
1. Trajectory Embedding: Before the training process, we first incorporate the features from both T h and T c , which can not only present the semantic relationships among different trajectories but also benefit the follow-up computation. 2. Historical Trajectory Learning Module: This module is involved to capture the mobility pattern of T h . We first use a Multi-Layer Perceptron (MLP) to incorporate the hidden states of T h , and encode them with a convolutional layer, which can capture the short-term pattern of trajectory. Then, we utilize a FC layer to capture long-term pattern of separated hidden states. Finally, we generate long short-term patterns with Max Pooling layer and concatenate them as the inputs of current trajectory learning Module. 3. Current Trajectory Learning Module: In this module, we involve a RNN architecture to capture the sequential transitions of T c . It receives the long short-term patterns and T c as inputs, and represent the future trajectory with the outputs of it. 4. Classifier: This module is the final component which incorporates the outputs of current trajectory learning module into a feature representation, Softmax layer is used to generate location probability and get predicted future trajectory. The factors of p mean location identification, activity identification and time, respectively.
The historical trajectory and current trajectory of user u. d; n Predefined threshold of time span and the length of each sub-trajectory. m; n The numbers of sub-trajectories and the length of the historical trajectory.
The former is the representation of embedded p, the latter is the hidden state of GRU architecture in time s. H his ; H cur The size of hidden layer in historical and current trajectory learning modules. v short ; v long The representation vectors of short-term pattern and long-term pattern. c; c conv ; c avg The first is the hidden states of the historical trajectory, follow-up two variables are convolutional c and representation result by using the average pooling method, respectively. W Ã ; U Ã ; b Ã The estimated parameter matrices and bias term.

Trajectory Embedding
As we mentioned, the sparsity of data's features is the key problem in prediction tasks. Therefore, we collect some related factors (such as location, activity, timestamp) to denote each record p. As in previous work [15,31], the embedding layer [34] can capture the information of the dynamic factors in each p. Compared with the one-hot encoding, the embedding layer effectively reduces the input dimension and is easily computed. Moreover, the embedding method helps find and share similar patterns among different trajectories. Therefore, we use the embedding layer to incorporate such factors of each p. As shown in Fig. 2, we learn the embeddings for each factor, then we concatenate them into a vector represented by c. The embedding methods of each factor are as follows: Timestamp Embedding. The original time information of each p cannot feed to the neural network directly. We reference the previous work and map one week into 48 slots (24 slots for weekdays and 24 slots for weekends). Then, each hour of timestamp is considered as an embedding unit. For any t in each p, the embedding method maps t to a real space (we refer to such space as embedding space) R d t by multiplying a parameter matrix W t 2 R 48Âd t .
Location and Activity Embedding. The original location and activity information of each p both cannot feed to the neural network similarly. Their identification l and a are both categorical values, we use the embedding layer to represent them respectively. For any l, each categorical value is mapped into the embedding space R d l by multiplying a parameter matrix W l 2 R LÂd l . Similarly, each a is mapped into the embedding space R d a by multiplying a parameter matrix W a 2 R AÂd a . L and A represent the vocabulary size of the categorical value l and a respectively, d l and d a are dimensions of their embedding space.
As we mentioned above, we concatenate each embedding of such factors. Therefore, final output vector of the embedding layer

Historical Trajectory Learning Module
This module presents the historical trajectory learning method. As mentioned above, T h has great impact on human mobility and helps predict future trajectory. In this module, the inputs are embedded T h 2 R nÂd . We first use a shared weight layer (e.g., the MLP) to incorporate each point representation vector into a hidden state.
where ReLU is the activation function, the parameter matrix W H 2 R H his Âd . H his is the numbers of neurons in this layer. Then, we obtain all hidden states of T h . They can be expressed as c ¼ ðc 1 ; c 2 ; …:; c n Þ, and c 2 R nÂH his . Based above operation, we can obtain weighted T h .
Short-term pattern: As mentioned above, we get the hidden states c of T h before the convolution operation. Note that c can be seen as a H his -channel input with the length n. Then, we use single layer 1 Â D convolution to capture the patterns of c. The convolution filter is parameterized with matrix W c 2 R kÂH his . k is the kernel size of the filter. The filter applies a convolution operation on the c, along with a one-dimension sliding window. Therefore, we can obtain convolutional feature map c conv i for each c i . Its calculation formula can be denoted as, where Ã denotes the convolution operation, c i:iþkÀ1 is the subsequence in the c from index i to i þ k À 1.
Actually, k is set as 2. We are more interested in first-order transitions of T h . Therefore, we set k to 2 and obtain short-term convolutional states of c. Furthermore, the out channels of 1 Â D convolution are same as input channels, i.e., H his . Thus, we obtain all convolution states c conv ¼ ðc conv 1 ; c conv 2 ; …; c conv nÀ1 Þ with a shape ½ðn À 1Þ Â H his .
As the Fig. 3 shows, we use a max-pooling method to capture the short-term pattern of convolutional states c conv . We are interested in using the max value of these convolutional states to represent the shortterm pattern, the max value can effectively express the contribution of each c i conv . Specifically, we represent the short-term pattern with a vector v short , where each dimension is the max value of each out channel over the c conv . Thus, v short 2 R H his .

Figure 3: Component of capturing short-term pattern
Long-term pattern: It is incomplete to only capture the short-term pattern, which cannot represent entire mobility pattern of T h . Not only that, it will influence the effect of prediction model. Fig. 4 shows the method of capturing the long-term pattern. We first obtain the hidden states c of T h by the MLP. Actually, T h is a trajectory sequence of length n with m sub-trajectories. Thus, c contain m hidden states of these sub-trajectories. Subsequently, we use an average pooling method and obtain the representation of the sub-trajectories c avg 2 R mÂH his . Note that we fix the length n of each sub-trajectory. Moreover, we set the kernel size of the average pooling method to n, the stride of sliding window is same as n. Hence, c avg ¼ ðc avg 1 ; c avg 2 ; …; c avg m Þ. Next, a FC layer is used to capture the correlation between these sub-trajectories. This FC layer is parameterized with a matrix W F 2 R H his ÂH his . Finally, the shared max pooling method is similarly used to represent the long-term pattern v long 2 R H his . To avoid repetition, it is not described here.
Based above two pattern learning methods, we obtain the long-term and short-term pattern. Then, we concatenate two mobility pattern and generate the historical long short-term pattern vector v 2 R 2ÂH his . Finally, v is used in the current learning module and set as the initial value of h in Eq. (3).

Current Trajectory Learning Module and Classifier
This module introduces a GRU architecture and use it to capture sequential information of the T c . To improve the readability of paper, we detail the classifier at the end of this module. As mentioned above, the length of T c is far less than the historical trajectories' length n. Moreover, RNN has powerful capabilities of capturing sequential pattern with short inputs due to its chain structure. Actually, users' mobility tends to have a strong correlation of previous footprint, which is similar to the high-order Markov Process. In other words, it is important to capture the sequential transitions of T c . Therefore, we involve an RNN architecture to capture the sequential pattern of T c . Here, we use the variant GRU of RNN proposed by [35]. The calculation formula of this unit is as follows: where Eq. (3) is the final output of this unit, z s is the update gate in time τ and decides how much the unit updates by the sigmoid activation function in Eq. (4). In Eq. (5), the candidate stateh s memorizes the current state in time τ selectivity, computed similarly to traditional RNN unit. r s denotes the reset gate and decides how much the unit forgets previous information. W and U are both parameter matrices, b is the bias term. Benefited from unique gating mechanism, Eq. (3) can keep the historical information and updates current information at each time step. Thus, the GRU architecture can capture the sequential information for time-series data.
In this module, we set two GRU architecture. We first feed the embedded current trajectory cðT c Þ 2 R ðnÀ1ÞÂd and pattern vector v 2 R 2ÂH his into the GRU architecture. Then, we capture the sequential information by h's iteration over temporal dimension in Eq. (3). The hidden state h contains previous information and is continuously conducted as a new input on the follow iteration. The initial value of h will be introduced in the next section. Finally, the outputs of this architecture are the representation of future trajectory o T c 2 R ðnÀ1ÞÂH cur and then fed into the classifier to generate predicted trajectory.
Note that o T c 2 R ðnÀ1ÞÂH cur can be expressed as o T c ¼ ðo nþ2 ; o nþ3 ; …; o nþn Þ, and H cur is the hidden size of GRU kernel. To predict the next location of T c , we add a FC layer and incorporate o T c into the high-level representation matrix f of future trajectory. The f can be denoted as: where f 2 R ðnÀ1ÞÂL is parameterized with a matrix W f 2 R LÂH cur . Then, the Softmax layer is constructed to generate the probability scores of future trajectories.

Model Interpretation and Parameter Learning
To understand our model more clearly, we interpret our model here. Although we cannot capture the sequential information of T h , we replace the RNN with the 1 Â D convolution. The convolutional states c conv contain the transition information of each trajectory point-pair. Actually, each state of c conv is the representation of transition between two points. Then, we obtain the representation vectors of each subtrajectory by the average pooling layer. These vectors contain the information for each stage of entire T h . Then the FC layer can help us capture the correlation of these vectors. Next, we generate the long shortterm pattern of T h through using a shared max pooling method. Subsequently, obtained long short-term pattern is the initial input for the current learning module. Finally, we predict users' next location by learning the mobility pattern of T h and sequential transitions of T c .
Based on the prediction model described above, we generate a probability distribution of the predicted trajectory over all the location with the GRU kernel. We can easily generate the probability distribution by the Softmax layer added in the final module. To train the model, we use negative log-like as the loss function for our model. In our training set, it contains multi-user trajectories. Therefore, we choose average loss of all users as our objective function. Given a training set, the objective function can be defined as: where Z u denotes the samples of training sets for u, n is the length of each sample. y j and l jþ1 are the predicted probability vector and the categorical value of target location at the j th step. Â contains all the parameters to be learned, including W Ã ,U Ã ,b Ã . To avoid overfitting, we add an L2 regularization term and predefine the value of . Moreover, we use Adam [36] and the Back-Propagation Through Time algorithm to optimize the objective function.

Data Information
We conduct our experiment on publicly available datasets of Foursquare collected from different cities, New York (NYC) and Tokyo (TKY). Foursquare data [37] includes long-term (about 10 months) check-in data from 12 April 2012 to 16 February 2013. NYC and TKY are the most prosperous areas in the world, and the most populous cities in USA and Japan respectively. Therefore, it is valuable to study human mobility using the check-ins datasets of two cities. Furthermore, two datasets have different check-in patterns influenced by cultural differences between the East and the West, which can verify the robustness of our model. The overall statistics of two datasets are shown in Tab. 2. Note that we take the categories of POIs (e.g., restaurant, gym and etc.) as activity keywords. For each dataset, we set the time span d of each sub-trajectory as three days, and then set the time interval of each p is not less than ten minutes. Next, we set the length n of each sub-trajectory as 10 and remove users who have sub-trajectories fewer than 5. Finally, we take 80% datasets of each user as the training set, and the 20% datasets as the testing set for the parameters study.

Evaluation Metrics
To make fair comparisons, we use the standard evaluation performance metrics, such as Top@k. Recall@k, Precision@k and F1-score@k. Specifically, we present each user with k of locations sorted by the predicted score using the classifier, i.e., S k u . Given a top-k predicted location list S k u and target location list l Ã u of test set, each evaluation metric can be defined as: where j denotes each sample in test sets. Then Recall@k and Precision@k are defined as: where S visited u;j is the locations list of u has visited in the test set. Note that above metrics are computed by averaging of all the values of all the users. Finally, the F1-score is the harmonic mean of Recall and Precision:

Comparative Methods
We compare our model with the following five baseline approaches: 1. Markov Chain (MC): We use the outperforming first-order MC method.

RNN:
We choose the GRU cell as its basic architecture. Actually, the performance of LSTM is close to the GRU. 3. ST-RNN [14]: ST-RNN is a RNN model taking spatial-temporal context as inputs for location prediction. 4. SERM [15]: SERM is a LSTM model that incorporates spatiotemporal context and activity keywords and predicts user's next visit. 5. DeepMove [16]: DeepMove is a strong neural network-based method based attentional mechanism, which capture historical periodic from historical trajectory and complex sequential transitions from recent trajectory.

Parameter Settings
In our experiment, the weights of our model are initialized with Xavier initialization. The initial learning rate is set to 1e-3, the embedding spaces d t and d a are set to 20 and 50 respectively. As the vocabulary sizes of t and a are much smaller, their embedding spaces are set to a practically small value. The embedding of space d l is set to the optimal hidden size. Note that we set both hidden size of historical trajectory module H his and hidden size of current trajectory module H cur as a same value for convenient calculation. The setting of hidden size is discussed in the next section. For two datasets, the detail settings of our experiment are shown in Tab. 3.

Experimental Results and Discussion
In this section, we compare our model with several strong methods. We first discuss the Top@k results of two datasets, and then discuss the results of others evaluation metrics on two datasets. Moreover, we introduce the choice of experimental parameters.

Performance Comparison
Actually, Top@k is the most important metric among all evaluation metrics. We can directly observe the next location of users by Top@k results, we therefore first discuss this metric. Note that we set k as 1 and 5 because of a greater value is usually be ignored in the reality. For the NYC dataset, we can observe that our model achieves the best results among all the baseline methods from the Tab. 4. Markov method builds the transition matrix according to the last location of each user, so that it is not surprising that Markov gets the worst performance. Then, the RNN model has better performance compared with the Markov model due to its strong sequence modeling ability. ST-RNN and SERM both exhibit higher results than the RNN model due to contextual features (e.g., spatial and temporal context). However, they cannot handle the complex transition regularities well with long trajectory, which leads to that the results are not significant compared with the RNN model. DeepMove has the best result among all the baseline methods due to its ability to exploit historical trajectories with attentional mechanisms. Compared with DeepMove, our model shows an increase of 11% and 20% in the Top@1 and Top@5 results, respectively. For the TKY dataset, we can observe similarly results compared with the NYC dataset. Since we have discussed the results as above description, so that we don't explore them here. Similarly, our model shows an increase of 10% and 19% in the Top@1 and Top@5 results, respectively. Moreover, we can observe that TKY dataset has more users compared with NYC dataset, which makes model train more difficult, so that all the Top@k results on TKY dataset perform worse than the results on NYC dataset.
As a classifier, we compare the classification performance with all the baseline methods. Therefore, we choose recall, precision and F1-score as others metrics. As shown in Tabs. 5 and 6, we can observe that our model gets the best performance compared with all the baseline methods. For the Recall@k results, our model shows an increase of 12% and 16% in Recall@1 and Recall@5 on NYC dataset, respectively. Besides, our model shows an increase of 11% and 6% in Recall@1 and Recall@5 on TKY dataset, respectively. Recall@k results on two datasets demonstrate that our model has better performance to classify true locations. For the Precision@k results, our model shows an increase of 12% and 31% in In summary, our model obtains the best performance among all the strong neural network-based methods. The improvements on such evaluation metrics can be ascribed to the following reasons. Firstly, we incorporate more contextual features of whole trajectories, which makes our model learn more knowledge. Secondly, we involve CNN-based methods to capture long short-term pattern effectively. Different from DeepMove used attentional mechanism, our model mainly captures mobility pattern of sequential sub-trajectory rather than contribution of separated trajectory points. Thirdly, we use the RNN architecture to capture the sequential transitions of current trajectories together with mobility pattern of historical trajectories. As described above, our model can capture complex mobility pattern of whole trajectories effectively.

Validation of Capturing Long Short-Term Patterns
Here, we validate the ability of capturing long short-term patterns of our model. To validate the ability of our model, we set three different variants, including RNN, UCG-Short and UCG. UCG-Short is our model without capturing long-term. we choose the same parameters of UCG for UCG-Short. In this experiment, we compare the Top@k results of three models to verify the long short-term ability. Top@k results can show the prediction performance intuitively, the experimental results are shown in Fig. 5.
From the results on two datasets, we observe that UCG Net has the best performance. By the comparison of RNN and UCG-Short, the results demonstrate that our model has the ability of capturing short-term patterns. By the comparison of UCG-Short and UCG, the results demonstrate that our model has the ability of capturing long-term patterns. Actually, our model has the best performance due to that our model can capture the long-term patterns.

Parameter Sensitivity
In this experiment, we investigate the impact of embedding size and the dimension of hidden layer on two datasets. Here, the embedding sizes of activity and time are relatively small, which the impact of their sizes is not significant. Therefore, we mainly discuss the choice of location's embedding size. Fig. 6 shows the influence of embedding size on two datasets, and Tab. 7 depicts the changes of F1-score@k. We can observe that the change of both two metrics is not significant, which denotes the robustness of our model. Here, we can obtain the optimal value when choose 400 and 900 as the embedding size. Therefore, we choose them as the optimal embedding size of two datasets.
By setting the embedding size as optimal value, we further discuss the impact of dimension of hidden layer on two datasets. Fig. 7 shows the influence of hidden layer' dimension, and Tab. 8 depicts the changes of F1-score@k results. We can observe that the performance of our model is getting better with the increasing  of hidden size, which denotes that the hidden size influences our model significantly. However, the performances on two datasets are both worse when the hidden size is too large, which means that overfitting problem may occur. Based above observation, we choose 400 and 900 as the hidden size of our model on NYC and TKY datasets, respectively.

Conclusions
In this paper, we propose a neural network-based method to capture the mobility pattern of human trajectory. We incorporate user's each trajectory point into a context-aware representation with multiple factors, i.e., location, activity and temporal information. Then, the embedding layer help capture semantic information among these trajectories. Furthermore, our model can capture long short-term pattern using CNN-based method. Specifically, we capture short-term pattern via 1 Â D convolution and long-term pattern through the average pooling method. The experiment on two real world datasets prove the   effectiveness of our model. In the future, we will conduct our model deeply. In fact, the architecture of our model is relatively simple, which limits the high-level representation learning process. Therefore, deeply architecture may help capture human mobility pattern more precise. Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.