Trip destination prediction based on a deep integration network by fusing multiple features from taxi trajectories

Trip destination prediction plays an important role in exploring urban travel patterns. Accurate prediction can improve the efﬁciency of trafﬁc management and the quality of location-based services. Here, a deep learning structure that contains three components: travel information extraction, classiﬁcation learning mechanism, and output module is proposed. Three types of information (the partial trajectory of on-going trips, historical trajectories, and related external information) are extracted in the ﬁrst component. Then, the classiﬁcation learning mechanism chooses different methods (i.e. Long Short-Term Memory network and Embedding technology) according to the characteristic of variables. Finally, an output layer that integrates the prior information about destinations is constructed. Two open-source trajectory datasets are used to validate the effectiveness of the proposed model. Results show that the proposed model outperforms benchmark models using only part of the information or using all of the information but ignore the classiﬁ-cation learning mechanism. The performance of the proposed model under different call types and travel durations is further explored. The result of this study will help understand travel behaviour in urban cities.


INTRODUCTION
With the concentration of populations in large cities, traffic demand has markedly increased in recent years. This rapid growth also places high pressure on urban transportation systems. The effective management of urban travel demand can optimize travel patterns, relieve traffic congestion, and reduce traffic pollution. To realize intelligently manage of travel demand, it is necessary to explore the regularity of citizens' daily travel behaviour based on large-scale trajectory data. With the widespread application of Global Positioning System-enabled in-vehicle systems and mobile devices, massive trajectory data collected from vehicles on road networks can be used to study urban travel mobility. The accumulation of trajectory data provides the opportunity for many possible applications that rely on vehicle location information, in which travel destination analysis and prediction are critical. The Enabling Advanced Traveler Information Services (Enable ATIS)  U.S. Federal Highway Administration (FHWA) investigates travel destination prediction because the highly accurate prediction of travel destinations has several effects on transportation management and control, can help to rationally allocate road network resources and improve traffic control schemes. Accordingly, the Traffic Management Center (TMC) can provide reasonable route suggestions to travellers based on current traffic congestion on a road network. Additionally, the prediction of destinations will also benefit many location-based services, such as providing more personalized navigation services, recommending sightseeing places, and mobile advertisements. Trip destination prediction aims to estimate the destination of the current trip based on the trajectory prefix and known information related to the current trip. For example, a child goes from home to the bus stop and then takes the bus to school. The trajectory from home to the bus stop is regarded as the trajectory prefix and the departure time, departure location, and the child's historical travel information are regarded as the known information related to the current trip. We regard the trajectory prefix as partial information and the known information as historical information and external information. Specifically, historical information describes the information of historical trajectories related to on-going trips, and external information includes the date and time, driver characteristics, and land-use information related to on-going trips.
People often travel regularly. For example, the child mentioned above may often leave home to school at the same time during the working day. Inspired by this phenomenon, probabilistic methods are widely used in destination prediction. The probabilistic method matches the current trip with historical trajectories based on the ID of the traveller, departure time, and departure location, and then determines the destination as the most likely location that is matched [1][2][3][4][5]. Because probabilistic methods are highly dependent on the existence of historical trajectories related to the on-going trip, this type of method does not work when the on-going trip cannot be matched to any historical trajectories. To overcome this problem, researchers apply deep learning methods to predict destinations based on the trajectory prefix of the on-going trip [6][7][8][9]. This type of method treats the trajectory as a time series, and predicts the destination based on the autocorrelation of the trajectory. Since it does not rely on historical trajectories, deep learning-based methods avoid the problem of data sparsity.
However, when predicting the destination, the existing deep learning methods are only based on partial information [7,8] or partial information and external information [6,9], and historical information is not fully utilized. Moreover, information processing should be further refined instead of simply splicing different features together and inputting them into the prediction model. This problem has not been noted in many existing deep learning-based studies in the field of destination prediction. For example, as a time series, trajectory data should be processed by a specific model (e.g., recurrent neural network) instead of being directly sent to a multi-layer perceptron [6]. It is also inappropriate to use a long short-term memory network, which is suitable for processing sequence data, to manage external information composed of discrete values [9].
In this paper, we propose a neural network-based model called Classification Learning Neural Network (CLNN) for destination prediction. We first extract the information related to the on-going trip from historical trajectories and use it as the input of CLNN together with trajectory prefix and external information. Then, a classification learning mechanism is designed to improve the model's ability to extract knowledge from the data. The classification learning mechanism chooses LSTM to model time series (i.e. trajectory prefix), combines embedding and MLP to model discrete variables (i.e. external information), and uses MLP to model continuous variables (i.e. historical information). Finally, a prediction module that incorporates prior information about the destination is developed to complete the prediction. The primary contributions of this study can be summarized as follows: • We integrate three types of information (partial, historical, and external) to improve the performance and stability of destination prediction.
• Based on the format and characteristics of variables in the different types of information, a classification learning mechanism is designed to extract hidden features that affect travel behaviour. • To explore the effect of the temporal distribution of trajectories on travel behaviour and destination prediction, trajectories are divided into three groups based on departure time.
The remainder of this paper is organized as follows. Section 2 summarizes the related work of destination prediction. In Section 3, we introduce the model framework and details of the proposed model. Section 4 presents an analysis of the data and the experimental results. The impact analysis of call type and travel duration on destination prediction is shown in Section 5. Conclusions and future research directions are then summarized in the last section.

LITERATURE REVIEW
Prior studies of destination prediction were primarily based on trip matching, which requires the existence of historical trajectories related to the current trip. Among these studies, the Markov model [1][2][3] and the Bayesian model [4,5] have been widely used. Specifically, Simmons et al. [10] combined GPS data and a map database to capture the driving habits of private drivers and then performed real-time prediction based on a Hidden Markov Model (HMM). Cho et al. [11] developed a location recognition and prediction system based on an HMM in the context of smartphones. All these studies require a map database to link location coordinates to the real road network. Alvarez-Garcia et al. [12] developed an HMM to use historical trajectories to predict future travel destinations without a street-map dataset. These methods are usually affected by the problem of data sparsity, which means that the trajectory to be predicted cannot be matched to any historical trajectory in some cases. To solve this problem, the Sub-Trajectory Synthesis (SubSyn) model is proposed to increase the number of query trajectories by nearly 10 times [13,14]. Also, trajectory similarity analysis can be used to mitigate data sparsity issues [15][16][17]. Chen et al. [18] classified trajectories into different clusters and predicted trip destinations based on trajectory similarities in the same cluster. Besse et al. [19] clustered historical trajectories and then used the internal trajectories of these clusters to make a final prediction. Terada et al. [20] incorporated trip purpose estimation into destination prediction. Miyashita et al. [21] designed a map-matching algorithm for estimating destinations in an artificial navigation system. Even with various improvement, probabilistic methods still face the challenge of trajectories data sparsity when predicting trip destinations.
Another commonly used method is to predict the destination based on the trajectory prefix. Krumm et al. [22,23] proposed a model called "predestination" to predict the next location of a trip using historical destinations and information on driving behaviour. Ziebart et al. [24] utilized a probabilistic reasoning method to predict destinations based on partial information about driving routes. External information, such as the date and time of trajectories, can be used to improve the accuracy of destination prediction [25,26]. In the 2015 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD) Challenge [27], Brébisson et al. [6] proposed a highly automated model based on multi-layer perceptrons that fused multisource information related to trajectories, including partial trajectory data, spatiotemporal features of trips and driver characteristics to make a final prediction. Endo et al. [7] converted a map into a two-dimensional grid, and each location point in the trajectory was represented by a one-hot vector based on its position in the grid. Then, a Recurrent Neural Network (RNN) was used to predict destination. Lv et al. [8] converted trajectories into images, and then extracted vehicle movement features using Convolutional Neural Network (CNN) to predict travel destinations. Zhang et al. [28] introduced an image frequency domain processing method to reduce noise and sparsity to improve prediction. Because there are few meaningful pixels in the trajectory image, this type of study will yield many invalid calculations and high computational demands.
Current research has focused on improving prediction structures with advanced models and integrating different data sources related to travel. Krause et al. [29] combined the trip purpose with the hierarchical Markov model to predict the destination. Rossi et al. [9] combined the Long Short-Term Memory network (LSTM) with attention mechanisms to predict travel destinations. Zhang et al. [30] introduced a Circular Fuzzy Embedding (CFE) technique in the feature representation of trips, and then applied an ensemble learning approach based on Support Vector Regression (SVR) and Deep Belief Network (DBN) to finish destination prediction. However, few studies have fully explored the useful information that is hidden in different data sources. In this paper, we propose a Classification Learning Neural Network to fully utilize the potential of knowledge hidden in the raw data.

Framework
As mentioned earlier, all three types of information are essential to destination prediction. Hence, we propose a model called CLNN to fully extract and integrate all three types of information based on the format and characteristics (e.g. discrete or continuous, time series or non-time series) of information itself. Figure 1 shows the network architecture of the proposed CLNN, which consists of three parts. The first part introduces how to extract three types of information from the original trajectory data, including historical information searching, external information integration, and partial information obtaining. Then, the second part proposes a classification learning mechanism, which uses different technologies (e.g. LSTM, Embedding, and MLP) to deal with different information in a targeted manner. Finally, the third part fuses the prediction results of different modules into the model output.

History information searching
In this section, we present an algorithm to extract related historical information. Most drivers have a fixed driving route. For example, commuting between their workplace and home or delivering children between school and home in the morning or evening. Therefore, capturing these similarities among large historical trajectories can improve prediction accuracy. We design a historical information retrieval algorithm to quickly search for historical information related to on-going trips. This algorithm searches for historical trajectories that are associated with the current trip based on fixed factors. These fixed factors include driver ID, call type, departure area, and departure time, which are denoted as f 1 , f 2 , f 3 , and f 4 , respectively. Because the similarity between destinations is primarily based on the driver's own travel habits, driver ID (i.e. f 1 ) will be given a higher priority. The specific steps for the historical information retrieval algorithm are described as follows:

External information integration
External information includes temporal information, driver characteristics, and POIs. Specifically, temporal information consists of day of week, week of year, and quarter-hour of day. The driver characteristics consist of taxi ID and call type. These five features are represented by discrete values, and the POI information is represented by a 10D vector. Traditional method uses one-hot vectors to represent discrete values. It usually encounters two problems: many redundant sparse matrices will be generated, and the relationship between the discrete values cannot be estimated. To overcome these limitations, we adopt an Embedding technique to encode discrete values [31]. Each discrete value is considered to be an input word and then another value from the nearby is selected as the output word. The input-output word pairs are then used as data to train a network. The weights of the hidden layer in the trained network are the dense representation of the corresponding discrete values. The embedding technique transforms discrete values into vectors of fixed size, while maintaining the relationship between discrete values in the vector space. For day of week, week of year, quarter-hour of day, taxi ID, and call type, we send each one into an embedding layer to generate a 10D vector. These vectors are fused with a 10D vector representing POI information, and fed to a Dense layer to obtain a deep representation of the external information.

3.2.3
Partial information obtaining Partial information is obtained by randomly cutting the original complete trajectory. We use the last location in a complete trajectory as the trip destination, and then randomly cut the The network architecture of CLNN trajectory without destinations to generate trajectory prefix. The deep learning network requires a fixed-length input, while the length of the trajectory prefix is inconsistent due to the random cutting process. It is necessary to further process the trajectory prefix to make the length of partial information consistent. Similar to [6], the first L GPS location points and the last L points of the partial trajectory are selected to form partial information, which consists of four parts: the longitudes of first and last L locations, and the latitudes of the first and last L locations. We set the length of the trajectory prefix in the cut process to be longer than L, and then stack the four parts into a matrix with a size of 4×L. Because the trajectory prefixes are available after drivers begin their trips, they are considered partial information.

MLP
Multi-Layer Perceptron (MLP) [32] is a feed-forward neural network, which consists of an input layer, several hidden layers, and an output layer. In MLP, each neuron in each layer is connected to all neurons in the next layer, while the neurons in the same layer are not connected. The MLP accepts fixed size vectors as input and processes them through one or several hidden layers, which compute a higher-level representation of the input. Finally, one or more neurons in the output layer return the predicted results. For a single neuron, MLP weights and sums the output value of all neurons connected to it in the previous layer, and obtain its output through an activation function.
where is the activation function, w is the weight, b is the bias, y l −1 i is the output of the i th neuron in the l − 1 layer, and n denotes the number of neurons in the current layer.
In MLP, weights determine the final output. Weights are first manually assigned an initial value, and then are continuously updated by calculating the gradients of the loss function. Finally, the optimal weights can be obtained by minimizing the difference between the predicted and observed values. Specifically, when the MLP contains only one network layer, it can be defined as Dense.

LSTM
Unlike a simple feed-forward neural network, the neurons in the same hidden layer of the RNN are interconnected. A single neuron thus receives the output of the neurons in the previous layer and the output of the neurons in the same layer. Due to gradient disappearance and explosion, ordinary RNNs have difficulty in capturing long-term dependence. Accordingly, LSTM was designed to solve this long-term dependence issue [33]. By improving the normal RNN, the memory cell is added to each neuron in the hidden layer to provide a path for the gradient to maintain long-term flow. In LSTM, the input gate, forgetting gate, and output gate are designed to make valid information last longer. The calculation process is shown as follows: where x t represents the input vector to the LSTM unit, h t indicates the hidden state and will be sent to the next neuron in the same layer, h t −1 indicates the hidden state passed from the previous layer, o t is the output gate's activation vector; c t is the cell state vector, f t andi t are variables that appear in the calculation process, g represents the sigmoid function, c and h represent the hyperbolic tangent functions, W and U are weight matrices, and b is bias matrix, and the operator •denotes the Hadamard product.

Fusion
In this experiment, inputs for destination prediction include partial information, historical information, and external information. Referring to the setting of [6], when performing random cutting, the hyperparameter L used to control the length of the partial information is set to 5. Therefore, the dimension of vector R used to represent partial information is 20 (5 × 4). To be consistent with the partial information, the value of K used to control the length of historical information is set to 10, resulting in a 20-dimensional vector H (10 × 2). The construction of the vector E used to represent external information requires two steps. The discrete values are first sent to the embedding layer to generate the embedded vector, and then the generated embedded vector is sent to the Dense layer together with the POI information. Referring to the setting of embedding in [9], the dimension of the embedding vector is set to ten. Then, to main-tain the dimensions of different information consistent, the hidden unit number in the Dense layer is set to 20.
Through the above processes, we obtain three vectors of the same size, including vector R , vector H , and vector E . These vectors are deep representations of the partial information, historical information, and external information. Finally, three vectors are fused by a simple weighted summation. W R , W H , and W E represent the weights for three vectors, which are controlled by the contributions of partial information, historical information, and external information to the results, respectively. Their optimal values will be determined in the model training process described in the following section.

Model output
To improve prediction performance, prior information about the destination is integrated into the proposed model. First, cluster the destinations of all training trajectories based on the mean-shift algorithm [34] to obtain C cluster centres. Then, a linear layer with softmax as the activation function is adopted to introduce the cluster centres into the destination prediction. The number of neurons in the linear layer is set to C, and the coordinates of the cluster centres are used to initialize the weights of the linear layer. The calculation process is shown as follows: where P i represents the output of the softmax function, c i represents the coordinates of the cluster center, e i is the output of a single neuron in the previous network layer, andỹ are the final predicted latitude and longitude coordinates.

Data source description
Taxi trajectory data collected in Porto City, Portugal [27,28] were used to verify the proposed method destination in this study. It contains nearly 1.7 million records from 2013-07-01 to 2014-06-30 (a total 12 months) from 442 taxi vehicles, as shown in Table 1. Each record is composed of a complete taxi trajectory and associated external information. The trajectory contains a list of GPS locations every 15 s and the last location represents the trip destinations. External information includes the taxi ID, the departure time of the trip, and the call type. The call type identifies the way passengers called taxi services. If the trip was dispatched from the central, the call type is denoted by refers to the i-th trajectory in D; 4. Condition library = [ 1 , 2 , 3 , 4 , 5 ]. Among it, 1  'A'. If a passenger calls taxi services at a specific stand, the call type is denoted by 'B'. The call type is denoted by 'C' when a trip starts randomly on a street.
We integrate Point of Interest (POI) data into the proposed deep learning model. POI is an entity of interest with a defined location. Typical POIs are restaurants, hotels, tourist sites etc. POI data in this study were obtained from the Foursquare. Foursquare provides ten different macro categories to describe location and activity, including Arts and Entertainment, College and University, Event, Food, Nightlife Spot, Outdoors and Recreation, Professional and Other Places, Residence, Shop and Service, Travel and Transport. In this study, we divided Porto City into many grids of the same size, and then counted the number of different categories of POIs in each grid square. Thus, we obtain a 10D vector for each grid, and the specific value in the vector represents the number of entities belonging to each category. Based on the corresponding relationship between the origin of the trajectory and the grid, we assign a POI vector to each trajectory to describe the level of facilities around the trajectory pick up point. Details of the trajectory data are shown in Table 2.
In addition, the public dataset Geolife is used to verify the effectiveness of the proposed model in different cities. Geolife records 18,670 trajectories by 182 users over 5 years. This dataset involves various outdoor activities of users, including daily routines, diversions, and sports.

Evaluation indicator
The evaluation indicator helps compare the performance of different methods. When choosing an indicator, we should consider the data used in the study and the requirements of the primary task. The evaluation indicator used in this study is the Mean Haversine Distance (MHD) error, which is widely used to evaluate the accuracy of destination prediction [29][30][31][32]. The Haversine distance measures the distance between two points on a sphere based on latitude and longitude, which is appropriate to calculate the distance between the predicted and actual destinations. The Haversine distance (km) between the two positions can be calculated as follows: where is the latitude, is the longitude and r is the radius of the Earth.

Optimization of hyperparameters
The optimization of hyperparameters is an important part of model training and also has a significant effect on model performance. The hyperparameters that must be determined in this study include the number of epochs, the learning rate, and the batch size. To accurately estimate the generalization error of the model, the original dataset is divided into four parts: history library, training set, validation set, and test set. The method of dataset division is to randomly extract 10% of the trajectories from the original dataset as a training set, 1% of the trajectories as a validation set, 1% of the trajectories as a test set. The remaining trajectories are used as a history library for searching historical information. A set of hyperparameters with the  smallest error on the validation set is selected to determine the final model. These final models will be compared under the same test set. The optimal learning rate and batch size are determined based on the error on the validation set. Figure 2 shows the variation of the MHD error on the validation set with epochs under the different learning rates and batch sizes. The learning rate and batch size have a significant impact on the model performance. When the learning rate is 0.001, the MHD error on the validation set is the lowest. The model with a batch size of 64 yields the smallest MHD error.
For the third parameter, the optimal number of epochs is selected when the MHD error value becomes stable. As shown from Figure 3, the MHD error gradually becomes small when the number of epochs reaches 40. Thus, the optimal number of epochs is set as 40.

Baselines and time division
In this study, the following models are selected as benchmark methods for comparison:  [29]. It consists of a hidden layer with 500 ReLU neurons and an output layer that is initialized with cluster coordinates. The network is trained with a stochastic gradient descent algorithm, a batch size of 200, a learning rate of 0.01, and a momentum term of 0.9. LSTM: LSTM is an improved recurrent neural network that is widely used in speech recognition and machine translation. It is also widely used in traffic flow prediction and traffic speed prediction [35,36]. Lasso: Lasso regression uses the L1 regularization term based on ordinary linear regression, is a simple and practical regression model, and is typically used as a baseline in traffic prediction tasks [37,38]. Random Forest: Random Forest (RF) is an ensemble learning method based on a decision tree. Random forest has strong generalization ability and is suitable for processing high-dimensional data. It is widely used in many traffic prediction tasks, such as traffic flow prediction [39] and crash duration prediction [40]. Bayesian regression: Bayesian regression is a traditional method in the field of destination prediction [4,5]. It can introduce prior information and display uncertainty in the estimation procedure. T-CONV: T-CONV is a state-of-the-art approach in the field of destination prediction. T-CONV treats trajectories as images and uses a convolutional neural network to achieve precise prediction [8]. SVR: Support Vector Regression is a classic machine learning model that has been successfully applied in various traffic prediction tasks, e.g., traffic flow prediction [41][42][43]. CLNN no-partial , CLNN no-external , and CLNN no-historical : Those are variants of the proposed CLNN, which correspond to the models with partial information, external information, and historical information missing. For a fair comparison, all models are trained on the same training set, hyperparameters are determined on the same validation set, and prediction errors are estimated on the same test set. In Bayesian regression, only historical information is used as input. The input variables of the MLP do not include historical information and POI information, but it uses a deep information extraction mechanism. The LASSO, RF, and LSTM use all three types of information as inputs, but do not apply a classification learning mechanism. The traffic states exhibit fluctuations and variations at different times of day. To further consider the influence of temporal factors on travel behaviour and destination prediction, we divide the original dataset into three subdatasets based on the different travel behaviours at peak hours, off-peak hours, and evening hours, as shown in Table 3.

Experimental constraints
This section summarizes some experimental constraints of this paper, as follows: • When searching for historical information, we assume that the trajectory in the dataset covers all available history records of travellers. • The main purpose of this paper is to predict the destination of private users to benefit some location-based applications, but it is difficult to obtain the trajectory of large-scale private cars. Two data sets are used, one consists of small-scale individual user trajectories, and the other is composed of largescale taxi trajectories. We ignore the impact of taxis as a means of public transportation. • We calculate the POI distribution in each grid based on grid division, and match the departure location of the trajectory with the grid to describe the level of facilities around the trajectory pick-up point. Here, we equate the grid with the vicinity of the departure location. Table 4 shows the MHD error of the models participating in comparison on the test set. The proposed CLNN outperforms the other models for different time periods. More specifically, the MHD error of CLNN is lower than that of Lasso, LSTM, SVR, and RF, which indicates that the classification extraction mechanism is effective. The MHD error of MLP and Bayesian regression is higher than that of CLNN, which shows that the fusion of the three types of information can improve predic- tion accuracy. Also, the MHD error of LSTM is higher than that of the traditional models, such as Random Forest and SVR, which shows that it is unbefitting to simply mixing all types of information and employ an inappropriate model to process. T-CONV considers the spatial correlations between trajectories, and outperforms the traditional machine models. However, due to the lack of historical information and external information, the MHD error of T-CONV is higher than that of CLNN. By comparing CLNN and its variants, we find that fusion of the three types of information can help improve the accuracy of destination prediction, and partial information is the most helpful. The MLP considers only two types of information as inputs but produces a smaller MHD error than the RF which considers all three types of information. This result indicates that extracting multiple features with reasonable methods could effectively improve prediction performance compared to simply adding new features in the input layer. By comparing the MHD errors of all models at different time periods, we find that the prediction performance of all models in the evening is inferior to the other two time periods. The reason for this result may be that taxi drivers usually have a fixed travel pattern during the daytime, but tend to wander around the city at night to find passengers. Searching for customers may increase the randomness of driving behaviour and affect prediction performance in the evening. Compared to the other models, CLNN achieves better prediction performance by properly fusing three types of information in the input layer.

Results comparison
In addition, Figure 4 shows the correlation of predicted and observed destinations using CLNN under longitude and latitude

RESULTS DISCUSSION
This section explores the effect of two key variables on the prediction results: the call type and the travel duration. These two variables can describe travel behaviour from two perspectives: before and after trips. The reason why these two factors are investigated is that they have strong explanatory meanings and have significant effects on travel behaviour.

Call type
There are three ways for passengers to request taxi services: from the calling centre (Type A), on a specific stand (Type B), and randomly on a street (Type C). To describe the impact of different call types on destination prediction, we first divide the test set into three classes base on the call type, and then continue to subdivide the dataset based on the time period, yielding nine sub-datasets. The trained CLNN model is then used to calculate the MHD errors of the nine sub-datasets. The results are shown in Figure 5(a). Because taxis can randomly pick up customers on the road, the randomness of the pick-up point of the trajectory increases, which affects the prediction accuracy. Thus, the destination of the trajectory identified as type C is much more difficult to predict than the other two call types. For type B, there are only 64 taxi stands in Porto City, and the randomness of the starting point of travel is relatively low. Thus, the accuracy of destination prediction by calling the taxi on a specific stand (type B) is the highest among the three call types. The prediction accuracy of the trip destination identified as type A is lower than that in type B. Customers who a b FIGURE 5 The MHD error for different call type and duration on the test set (a) call type, (b) duration book taxis from the central system typically have relatively stable travel habits and behaviours.
Also, the destination prediction error for type A in the evening hours is much higher than that in peak and off-peak hours. This may be caused by the small number of samples at night and insufficient samples for model training.

Travel duration
Travel duration can be divided into four categories: 0-5 min, 5-15 min, 15-30 min, and more than 30 min. The test set is thus divided into 12 sub-datasets. Figure 5(b) shows the MHD error of models using different sub-datasets. As the travel duration increases, the prediction error grows gradually. When the travel duration is more than 15 min, the error increases exponentially. When the duration is less than 15 min, the model can produce high prediction accuracy. Moreover, compared with other durations, the prediction error in the evening is much higher for a trip that lasts longer than 30 min. Since the number of trips lasting more than 30 min in the evening is very small, and the samples used for training is insufficient, resulting in this result.

CONCLUSION
In this paper, we developed a deep learning framework for destination prediction by fusing different types of information. Two public trajectory datasets were used to verify the proposed model, and MHD was used as an indicator to evaluate prediction performance in different time periods. In model comparisons, benchmark methods were divided into four categories: Machine learning models with all information simply combined as input variables, MLP with an appropriate information extraction mechanism using only some information, traditional Bayesian regression that used only historical information, and variants of the proposed model. The contributions of this research can be summarized as follows. First, the proposed CLNN achieves the lowest MHD error over different time periods, demonstrating the effectiveness of the classification learning mechanism and the fusion of three types of information. Second, some internal factors affecting the performance of destination prediction were revealed. For example, for the trips with a short duration from the central or at a specific stand, the proposed model yielded higher predictive accuracy. In contrast, trips that occur at night and last longer than 30 min were difficult to predict. This study has some limitations that required further investigation. First, different cities have different road network structures, traffic policies, and management measures. Different modes of transportation also have different traffic patterns. In the future, fusing different data sources should be considered to determine the impact of different transportation modes, road network structure, and transportation policies on travel destination prediction, so as to provide a more effective strategy for traffic management and control. Second, for trajectories that last longer than 15 min and trajectories generated from taxis that wander around the city, prediction error was relatively high. Future research should focus on improving the prediction performance for trips with long travel durations and random origins.