Taxi Demand Prediction Based on a Combination Forecasting Model in Hotspots

Accurate taxi demand prediction can solve the congestion problem caused by the supply-demand imbalance. However, most taxi demand studies are based on historical taxi trajectory data. In this study, we detected hotspots and proposed three methods to predict the taxi demand in hotspots. Next, we compared the predictive effect of the random forest model (RFM), ridge regression model (RRM), and combination forecasting model (CFM). *ereafter, we considered environmental and meteorological factors to predict the taxi demand in hotspots. Finally, the importance of indicators was analyzed, and the essential elements were the time, temperature, and weather factors. *e results indicate that the prediction effect of CFM is better than those of RFM and RRM. *e experiment obtains the relationship between taxi demand and environment and is helpful for taxi dispatching by considering additional factors, such as temperature and weather.


Introduction
Taxi is an essential part of urban public transportation, and taxi demand is different from others because of its stochastic trajectory and dependence of spatial location [1,2]. However, the imbalance between the supply and demand of taxis is particularly severe due to the uneven information distribution between drivers and passengers [3]. Taxi drivers' customer-searching behavior relies on historical experience, and passengers' trips are random. e information asymmetry of taxis and passengers wastes limited public resources [4]. us, the taxi demand in the hotspots should be predicted [5].
Previous studies on taxi demand prediction are generally based on historical taxi trajectory data. Previous studies have shown the feasibility of obtaining predictions from historical taxi trajectory data [1,[5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23]. Methods of traffic demand prediction can be classified into three types: linear system theory (such as the autoregressive moving average model [24], Kalman filtering model, and time series model), nonlinear system theory (such as the neural network model, gray prediction model, and random forest model (RFM)), and combination forecasting model (CFM). e first application of the time series prediction model in traffic prediction research was modeling the univariate traffic flow data as seasonal autoregressive integrated moving average processes [25]. Shekhar used the Kalman filter model to study univariate traffic condition predictions [2]. Alvarez-Garcia et al. proposed a system based on the hidden Markov model to predict taxi trip destinations [26]. Chang et al. mined historical taxi trajectory data and predicted the time and spatial distributions of taxi demand [9]. Moreira-Matias et al. introduced a new method for using traffic flow data to predict the spatial distribution of taxi passengers in the short-term time. A CFM combining three time series prediction methods that can effectively determine the spatiotemporal distribution of taxi passenger demand was proposed [17]. Lv et al. proposed a traffic flow prediction method based on deep learning considering spatiotemporal correlation and used an autoencoder model to learn traffic flow characteristics [27]. Zhang et al. proposed an adaptive prediction method to predict a hotspot location and its heat [22]. Zhao et al. implemented and compared three predictors for predictive algorithms that determine maximum predictability: Markov, Lempel-Ziv-Welch, and neural network predictors [13]. Davis used a time series model to predict taxi travel demand based on mobile app taxi services [28]. Zhao et al. proposed a new prediction model based on long short-term memory (LSTM) networks. e proposed LSTM network considered the spatiotemporal correlation in traffic systems [29]. Zhang et al. proposed a Dmodel based on the hidden Markov chain model for taxi prediction [21]. Yu et al. proposed a spatiotemporal recurrent convolutional network for traffic volume prediction based on the deep convolutional neutral network [30]. Ou et al. proposed a method of combining the bias-corrected random forest algorithm with the data-driven feature selection strategy for short-term urban traffic flow prediction to solve the problem of unreasonable feature selection [31]. Yao et al. proposed a deep multiview spatiotemporal network framework to simulate spatiotemporal relationships based on traffic prediction models [32]. Bao et al. considered the interaction between subways and taxis based on univariate traffic prediction and applied the residual neural network to predict the taxi demand in different regions [6]. Ishiguro et al. proposed a taxi demand prediction algorithm using realtime demographic data generated by cellular networks and used a stacked denoising autoencoder to assess the impact of real-time demographic data on taxi demand prediction accuracy [12]. Markou et al. considered the information provided by unstructured data while using taxi GPS data and used machine learning techniques to predict taxi demand [11]. Xu et al. believed that the occurrence of taxi request behavior is related to the historical traffic behaviors and proposed an LSTM model, which can predict taxi requests for each region of the city based on historical demand and other relevant information [19]. Past research has mostly focused on pickup points. Rodrigues et al. considered dropoff points and combined the time correlation with the spatial correlation to predict the taxi demand with an LSTM method [18]. Kuang et al. proposed two deep learning methods that combine unstructured textual information with historical taxi trip data for traffic demand prediction research [15]. Furthermore, Castro et al. conducted a review of studies on traffic GPS data and proposed a new direction based on GPS data [33].
Previous works have focused on mining the regularity of trajectory data to predict the traffic demand, but environmental data have been ignored. Furthermore, the method that combines linear and nonlinear system theory has been rarely proposed. is study aims to explore the prediction method combining RFM and RRM for predicting taxi demand in hotspots. Moreover, environmental data are considered. First, the method identifies the taxi demand hotspots in the city. en, we predict taxi demand at various time periods using the RFM and RRM [34]. Next, we propose a CFM model that combines the RFM and RRM. e forecasting method considers environmental and historical taxi trajectory data. is study is beneficial for traffic management rebalancing taxis. e paper consists of four sections: Section 1 describes the importance of taxi demand prediction and focuses on related research about taxi demand prediction; Section 2 describes the data and method we used in this study; Section 3 describes the results of the experiment; discussion and future research are included in Section 4; and Section 5 describes the conclusion.

Data
2.1. GPS Data. GPS data are from the Xi'an Taxi Management Office and consist of vehicle location data that are recorded every 5 s for 30 days. e dataset consists of 40 million track points. e GPS data have undergone extensive cleaning, and only error-free trip strings are used in this research ( Figure 1).

Environmental Data.
e purpose of this study is to accurately predict the demand for taxis in hotspots by constructing a set of affecting factors of the taxi demand. erefore, the impacts of air quality, weather, wind speed, and temperature on demand for taxis are considered. In this study, the influencing factors of taxi demand are constructed on the basis of two types of data: air quality and meteorological data.
e air quality data are derived from the official website of Green Breathing. e detection indicators include various pollutant data, including PM2.5 and PM10, and the air quality level of the day can be defined according to the AQI. e meteorological data are from the National Meteorological Information Center. is study selects the hourly data of Xi'an, including hourly observations of temperature, pressure, humidity, wind speed, and precipitation. e air quality data used in this study have seven dimensions, and the meteorological data have five dimensions (Table 1).

Random Forest Model.
RFM is an ensemble learning algorithm and an extension of bagging [35]. At each node of each decision tree, a subset of k feature attributes is randomly selected from the feature attribute set of the node; then, the best feature attribute is selected from the subset for division ( Figure 2).

Ridge Regression
Model. RRM is a partial estimation method designed for collinear data analysis and is an improved least-square estimate method. e regression coefficient becomes realistic and reliable by abandoning the unbiasedness of the least-square estimation and losing part of the information. An RRM fits the ill-conditioned data more accurately than the least-square estimation.
e simplest linear regression model defines the loss function as the square of the residual. en, the optimization objective is expressed as follows:   features, and the number of samples is relatively small. Regularization terms can be used in the aforementioned formula. e L 2 norm regularization is introduced into the RRM as follows: We define Γ � αI, where I is the identity matrix, and is shown as As α increases, the absolute values of the elements in L(α) tend to decrease, and the deviation of correct value θ i increases. When α tends to infinity, L(α) tends to 0. e trajectory of L(α) that changes with α is called the ridge. When the ridge is stable, α is the optimal value. In general, the R 2 value of the ridge regression equation will be slightly low, but the significance of the regression coefficient is usually significantly high.

Combination Forecasting
Model. CFM can solve special prediction problems in research by combining the characteristics of different models. e calculation can be expressed as where y CFM,i is the predicted value of the CFM, y RRM,i is the predicted value of the RRM, y RFM,i is the predicted value of the RFM, and λ 1 and λ 2 are the weight coefficients of RRM and RFM, respectively. e core of the CFM is the determination of the weight coefficients λ 1 and λ 2 . Inverse-variance weighting method is used to determine the weight coefficient of the CFM. e calculation equations are expressed as follows: e squared error sum of the RRM is expressed as equation (7), and the squared sum of the RFM is expressed as equation (8): where e RRM (i) represents the sum of squared errors of the RRM, e RFM (i) represents the sum of squared errors of the RFM, y i represents the true value, y RRM,i represents the fitted value of RRM, and y RFM,i represents the fitted value of the RFM.

GPS Data Processing.
e "STAT" attribute in taxi GPS data is the record of the taxi driving state, in which "4" represents the passenger and "5" represents empty driving. A change from "4" to "5" indicates that the passenger exits the vehicle. is record is recorded as point D. A change from "5" to "4" indicates that the passenger enters the vehicle. is record is recorded as point O.

Feature Selection.
Ensuring that the features are independent of one another is difficult because of their large number in the experiment. In the modeling process, two features with a strong correlation tend to exhibit multiple collinearities in the data. erefore, the correlation of the experimental data features should be tested.
e method chosen in this study is the Pearson correlation analysis, which can measure the linear relationship between variables. e calculation is expressed as follows: where cov(X, Y) represents the covariance between the variables X and Y, σ X and σ Y represent the standard deviations of the variables X and Y, and ρ X,Y represents the correlation coefficient of two continuous variables; the value of ρ X,Y is between −1 and 1. If ρ X,Y > 0, then the two variables are positively correlated; if ρ X,Y < 0, then the two variables are negatively correlated. A large absolute value of ρ X,Y corresponds to a strong correlation. e corr function of the pandas library in Python is applied to obtain the correlation coefficient matrix (Figure 3). Figure 3 shows that the correlation among PM2.5, PM10, and AQI is strong. A slight multicollinearity is observed in the correlation between O 3 and TEM (temperature); therefore, a correlation exists between RHU and TEM. Indicators with severe multicollinearity are excluded. us, indicators PM2.5 and PM10 are eliminated.
Four indicator variables of hour, wdy, week, and holiday are also added to explore the impact of time, week, weekday, and holiday factors on the taxi demand (Table 2).

One-Hot
Encoding. All data are encoded using the onehot encoder function in the scikit-learn.preprocessing library.
e week attribute is taken as an example (Figure 4).
After the one-hot encoding, the data dimension has expanded to 39. In the experiment, the sample size of the dataset is small, and the verification and test sets can be combined when dividing the dataset. e first 23 days of April 2017 are taken as the training set, with the other 7 days as the test set.

Extract Hotspots.
e ArcGIS 10.2 kernel density analysis tool is used to analyze the kernel density of the residents' pickup and get-off positions in the three time periods of the working and rest days ( Figure 5).
As shown in Figure 5, the taxi demand on weekdays and nonworking days are mainly distributed in the main roads of Xi'an. e taxi demand at various peak hours is also distributed among the main roads of Xi'an. Xi'an taxi demand intensive areas are normalized and have no visible spacetime character. e 30-day thermogram is superimposed ( Figure 6).
Hotspots are distributed in areas such as Xi'anbei Railway Station, Bell Tower, Xiaozhai, Railway Station, and City Library. Xi'anbei Railway Station and Railway Station are transportation hubs. Xiaozhai, City Library, and Bell Tower are commercial areas. In this study, two representative areas, namely, Bell Tower and Xi'anbei Railway Station, are selected (Figure 7).

Random Forest Prediction.
Using Python's sklearn.ensemble library, we can use random forest regression (RFM) ( Table 3). e main influencing factor of RFM is "n_estimators." We use the goodness of fit (R 2 ) to adjust the parameters of RFM. e calculation is expressed as follows: where N is the sample size, SST is the sum of squares, SSR is the sum of squares of regression, SSE is the sum of squared    Journal of Advanced Transportation residuals, y i is the value to be fitted, y is the mean of y, and y i is the fitted value. Considering the number of samples and training speed of RRM, we choose [1 − 200] as variable span. e relation between "n_estimators" and R 2 can be calculated (Figures 8  and 9). e adjusted optimal parameters for Xi'anbei Railway Station and Bell Tower areas are shown in Tables 4 and 5. RFM can score the importance of feature attributes. In the RFM, evaluating the importance of feature attributes is based on the random replacement of the permutation principle. e reduction in the mean square residual and the prediction accuracy reflects the importance of characteristic variables. In this study, the calculation of the mean square residual reduction is used to evaluate the importance of the variables: (1) We assume M regression trees in the random forest.
OBB i represents the out-of-bag data of the ith tree. e out-of-bag mean square deviations of each tree are MSE OOB 1 , MSE OOB 2 , . . . , MSE OOB M .
(2) We assume that the total number of variables is N.
For each input variable X i , random replacement in M out-of-bag data is conducted. M new out-of-bag data OOB are obtained, and the mean square deviation of the new out-of-bag data is calculated. en, an out-of-bag error matrix can be constructed as follows:  (3) e out-of-bag error MSE OOB 1 , MSE OOB 2 , . . . , MSE OOB M before replacement is subtracted with the ith row of the out-of-bag error matrix. en, the significance score of X i is the average of the abovementioned calculated results, as shown in the following equation: A large value of VIM i corresponds to a great contribution of the variable.
is study uses the featur-e_importances_ function in RMM of the scikit-learn library to score the input variables (Figures 12 and 13).

Ridge Regression Prediction.
Using Python's sklearn.ensemble library, we can find the implementation of ridge regression prediction models (Table 6).    Journal of Advanced Transportation e two most essential parameters in the RRM are the regularization intensity (alpha) and computational solver (solver) ( Table 7).
After the RRM with the optimal parameters is constructed, the prediction results are shown in Figures 14 and 15.
After the training of the RRM, the fitted model can be output. e standardization process is performed in advance. us, the model has no intercept term, and each index coefficient represents the importance of the index (Figures 16 and 17).

Combination Forecasting Model.
e weight coefficients of two models in the CFM can be obtained by the sum of residuals of RFM and RRM on the training set. e weight coefficients of RFM and RRM are λ 1 � 0.793067 and λ 2 � 0.206933, respectively. e prediction results are shown in Figures 18 and 19.
We use mean square error, mean absolute error, and goodness of fit (R 2 ) to test the prediction effect of three models (Tables 8 and 9). Figures 10,11,14,15,18,and 19 show the prediction results of taxi demand in the Xi'anbei Station and Bell Tower areas through by RRM, RFM, and CFM. en, Tables 8 and 9 analyze the forecast effect of three forecasting methods. e tables indicate that CFM has the highest accuracy among the three models.               As shown in Figures 12 and 13, the most crucial factor in taxi demand is hours in the Xi'anbei Station because the station is a transport hub. is finding illustrates that taxi demand in a transport hub has a strong correlation to the time factor. Figures 12 and 13 also show that O 3 is the main factor in the Bell Tower. Ozone concentration is related to temperature, and hot weather increases the taxi demand in the commercial area. However, Figures 16 and 17 imply that the main factors of RRM in two areas are time factor and O 3 . Differences between the two areas of RRM are less than those of RFM.

Conclusions
In this study, we investigated the taxi demand prediction in hotspots and then proposed three prediction models, namely, RFM, RRM, and CFM. We extracted hotspots of taxi demand, and the taxi demand prediction model was constructed on the basis of taxi demand hotspots. e proposed models combined time, meteorological, and environmental characteristics to explain the generation of taxi demand. e prediction results show that CFM has better robustness and smaller error than FRM and RRM in the Xi'anbei Railway Station area and the Bell Tower area. e experiment also indicates that taxi demand prediction is mainly affected by the time period in the Xi'anbei Railway Station. In the Bell Tower area, the importance of ozone concentration and temperature to the model is relatively advanced. e study concludes that the proposed model can improve prediction accuracy. e most important influencing factor of the taxi demand prediction model is the time factor. Temperature and weather indicators are also relatively important.
Some limitations in the research on taxi demand prediction still need to be addressed. For example, the impact of other similar types of traffic demand is ignored in this study. If travel demand can be met by an online carhailing service, then taxi demand will be greatly reduced.
is study also ignores the impact of land use properties on taxi demand, which will be one of our future research directions. Part of environmental features is challenging to obtain. us, we will propose a method to predict environmental features for predicting taxi demand more precisely in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.