Predicting subway passenger flows under different traffic conditions

Passenger flow prediction is important for the operation, management, efficiency, and reliability of urban rail transit (subway) system. Here, we employ the large-scale subway smartcard data of Shenzhen, a major city of China, to predict dynamical passenger flows in the subway network. Four classical predictive models: historical average model, multilayer perceptron neural network model, support vector regression model, and gradient boosted regression trees model, were analyzed. Ordinary and anomalous traffic conditions were identified for each subway station by using the density-based spatial clustering of applications with noise (DBSCAN) algorithm. The prediction accuracy of each predictive model was analyzed under ordinary and anomalous traffic conditions to explore the high-performance condition (ordinary traffic condition or anomalous traffic condition) of different predictive models. In addition, we studied how long in advance that passenger flows can be accurately predicted by each predictive model. Our finding highlights the importance of selecting proper models to improve the accuracy of passenger flow prediction, and that inherent patterns of passenger flows are more prominently influencing the accuracy of prediction.


Introduction
Public transportation plays an indispensable role in modern big cities. Developing public transportation is regarded as the most effective way to solve the ubiquitous traffic congestion problems [1,2]. The subway is regarded as the backbone of urban public transportation, and is characterized by high speed, convenience, and mass flow features [3][4][5][6][7]. Despite the fact that subway services have been continuously improved in many big cities, the upgraded supply usually cannot meet the even faster growing demands of human mobility, especially in developing countries. Compared with opening new lines or increasing the operating frequency of trains, intelligent operation is a smarter and more cost-efficient way to improve the level of service. This calls for accurate and robust prediction of passenger flows to guide better use of the capacity of subway networks. Despite that some passenger flow prediction models have been proposed, we revisited this important problem from two new perspectives.
First, we analyzed the performance of different predictive models under different passenger flow (traffic) conditions. In general, traffic conditions can be classified into ordinary PLOS  geographically and temporally weighted regression (GTWR) model to identify the spatiotemporal influence of the built environment on transit ridership. Traffic simulation models were widely used with the popularization of computers in scientific research. In 2001, Chrobok et al. [21] presented an approach based on a micro-simulator to predict traffic flow in the freeway network of North Rhine-Westphalia. In 2010, McCrea et al. [22] proposed a novel hybrid approach that combines the advantages of the traffic simulation model and linear system theory. In their model, traffic dynamics was first simulated using a continuum mathematical model to obtain relevant traffic parameters of road segments, and the obtained parameters were used as inputs for the Bayesian model for traffic flow prediction. Under the same requirement of prediction accuracy, the hybrid approach improved the computing efficiency compared to the Bayesian network model.
In recent years, knowledge discovery methods have been used more frequently in traffic prediction. Representative methods include nonparametric regression analysis, artificial neural networks, support vector machines, wavelet analysis, and gradient boosting decision tree [23]. In 1991, Davis and Nihan applied nonparametric regression to predict traffic flow in a freeway; however, the accuracy of prediction was lower than that of the linear time-series method [24]. Twelve years later, Clark applied the method of multivariate nonparametric regression to predict the traffic state of a motorway [25]. The method was simple and easy to implement, requiring only modest data storage, and produced reasonably accurate short-term forecasts of traffic flow and loop occupancies (in the percentage of time a loop is covered by a vehicle).
Artificial neural networks were born in the 1940s, and first introduced in traffic flow prediction by Vythoulkas in 1993 [26]. He employed an artificial neural network to predict the traffic state of a city road network. Two years later, Dougherty summarized the application of neural networks in transportation studies [27]. The transportation research community saw an explosion of interest on neural networks in the 1990s. A variety of neural network models have been proposed to predict traffic conditions. Representative examples include the multilayer perceptron neural network model [28], radial basis function neural network [29,30], spectral basis artificial neural network [31], time delayed neural network [32], and recurrent neural network [33]. Models combining neural networks with other factors (e.g., time series [34], genetic algorithms [35], fuzzy logic rules [36], empirical mode decomposition [37], etc.) were also studied. Support vector machines were formally published in 1995 [38], and studies on support vector regression (SVR) began in 1997 [39]. Support vector regression was used for travel-time prediction [40,41]. Wu et al. [40] validated the feasibility of applying support vector regression in travel-time prediction, the mean relative errors for traveling different distances were less than 5% in the test dataset. Vanajakshi et al. [41] found the support vector regression performs better than artificial neural network when the training data is less or when there are a lot of variations in the training data. Recently, Jiang et al. [42] combined the ensemble empirical mode decomposition with gray support vector machine to predict the short-term passenger flow of high-speed rail (HSR), and the mean absolute percentage errors of the hybrid model is about 6%, which performs better than the SVM model and the ARIMA model. Wavelet analysis, which was developed in the 1980s, is usually used to decompose a set of original traffic flow signals into signals with different time series to reflect and distinguish the internal variation trend and stochastic disturbance of traffic flows. He et al. [43] proposed a method based on wavelet decomposition and reconstruction combined with the time series model for traffic volume prediction. And the processed signals with different characteristics can be combined with the dynamic neural network [44], support vector machines [45], and other methods, to predict traffic flow.
In this study, the smartcard data of more than 6 million subway passengers and geographic information data of the Shenzhen subway network were used. We analyzed four classical predictive models: the historical average (HA) model, multilayer perceptron (MLP) neural network model, support vector regression (SVR) model, and gradient boosted regression trees (GBRT) model. Different from previous studies, we explored the high-performance models under different traffic conditions, and studied how long in advance that passenger flows could be accurately predicted by each predictive model.
The paper is organized as follows. Section II describes the geographic information data and passenger mobility data used in this study. Section III introduces the passenger flow prediction models and algorithm used to classify passenger flow (traffic) conditions. Section IV analyzes and discusses the passenger flow prediction results of different models, and identifies the highperformance models under different traffic conditions and different model implementation conditions (how long passenger flows are predicted in advance). Section V concludes the results, and discusses future research directions.

Data
The geographic information systems (GIS) data and smartcard data of Shenzhen subway passengers were both provided by Shenzhen Transportation Authority. Data collection was conducted in 2014; the collection of smartcard data was from October 1, 2014 to December 31, 2014. In 2014, the subway network consisted of 118 subway stations. Stations opened after 2014 were not considered due to lack of smartcard data for the new stations. Once a subway passenger employs his/her smartcard when entering or existing a subway station, the time, card ID, and subway station ID are recorded. In the three-month data collection period, a total of 262 million passenger records were generated. For some days, there was data missing for a few hours or the whole day; therefore, only days with complete records were used in this study (80 days in total).
The three-month observation period was split into 7,680 time windows, with each time window spanning 15 min. Taking the operation period of Shenzhen Metro into consideration, the time period of data collection for each day was from 7:00 a.m. to 10:30 p.m.. Therefore, there are only 62 time windows in each day used for training data and testing data. The time windows from 10:30 p.m. to 7:00 a.m. are not considered because few smartcard data are available during the late-night period. We calculated the number of passengers entering a subway station s during each time window t, in-passenger-flow N in (s,t), and the number of passengers exiting a subway station s during each time window t, out-passenger-flow N out (s,t) (Fig 1A and  1B). Heterogeneous distribution of passenger flows is observed in the studied subway network (Fig 2A and 2B). The in-passenger-flow can be approximated by two different fitting functions for large and small N in (s,t) (gray dashed lines are plotted to guide the eyes): fit1: P(N in (s,t)) = 0.017 (N in (s,t)) −0.304 when N in (s,t) 150 persons; fit2: P(N in (s,t)) = 0.009 exp(−0.006 N in (s,t)) when N in (s,t) > 150 persons.
The out-passenger-flow also can be approximated by two different fitting functions for large and small N out (s,t) (gray dashed lines are plotted to guide the eyes): fit3: P(N out (s,t)) = 0.017 (N out (s,t)) −0.384 when N out (s,t) 150 persons; fit4: P(N out (s,t)) = 0.005 exp(−0.004 N out (s,t)) when N out (s,t) > 150 persons.
Roughly 58.47% of in-passenger-flow N in (s,t) and 50% of out-passenger-flows N out (s,t) were smaller than 200 passengers/15 min; for some stations, passenger flows were larger than 1,000 passengers/15 min. In the following sections, measured in-passenger-flow N in (s,t) and out-passenger-flow N out (s,t) were used as the ground truth data to train the passenger flow prediction models and validate the predictive results.
The used subway smartcard data were split into two parts. The first part of the data, which recorded the subway passenger trips generated during October and November of 2014, were used as the training dataset. The second part of the data, which recorded the subway passenger trips generated during December of 2014, were used as the testing dataset. Training datasets were denoted by D = {(x 1 ,y 1 ),(x 2 ,y 2 ),. . .,(x n ,y n )}, where x n 2 R d represent the input features of the training data, and y n 2 R l represent the output results of the training data. The sample size n equals 59 because there were 59 days' smartcard data in the training dataset. Data dimensions d and l represent the number of input and output features used in the models respectively. The prediction models When predicting the passenger flow of a subway station s during a time window t target , subway station s is called target station, and time window t target is called target time window. We evaluated the performances of four predictive models under different model implementation conditions; predictions were made in different number (n step ) of time windows before t target , and n step =1,2,. . .7,8 were tested. Here, we briefly introduce the advantageous and disadvantageous features of each of the four predictive models used in this study. The HA model is easy to implement in practice, but performs poorly under unexpected traffic conditions. The multilayer perceptron (MLP) neural network employed in study is trained using back-propagation. In general, the MLP model works well in capturing complex and nonlinear relations; however, it usually requires a large volume of data and complex training procedures. For the employed SVR model, a linear kernel function was used to predicts passenger flows; however, the selection of best kernel functions is an unsolved problem in this scientific community. Lastly, the GBRT model uses a negative gradient of loss function as an estimate of residuals. In general, the GBRT model also works well in exploring complex and nonlinear relations; however, it cannot train data parallelly.
In the generated HA model, the average in-passenger-flow (or average out-passenger-flow) during the target time window t target of all days in the training dataset were used as the predictive result in the target time window for all days in the testing dataset. Clearly, the HA model was unable to capture the random disturbances of passenger flows, and therefore had the worst prediction accuracy and served as a baseline model for comparison with the other three models. For the MLP model, SVR model, and GBRT model, in-passenger-flows N in (s,t) during time window t of all days in the training dataset were used as inputs, and in-passenger-flows N in (s,t target ) during the target time window of all days in the training dataset were used as outputs to train the predictive model; t is n step time windows before the target time window t target . In a given day of the testing dataset, the in-passenger-flows N in (s,t) were used as inputs to predict N in (s,t target ), where t is n step time windows before the target time window t target . Parameter n step determines how long in advance predictions are conducted. Similarly, models were generated to predict N out (s,t target ). Methods for generating the MLP model, SVR model, and GBRT model are briefly described in the following subsections. Please refer to the literature [46][47][48][49] for further details on the generations of these models. The training dataset D = {(x 1 ,y 1 ),(x 2 ,y 2 ),. . .,(x n ,y n )}, x n 2 R d ,y n 2 R l were used in the MLP model, SVR model, and GBRT model. Parameters d and l represent the dimensions of x and y, respectively. In this paper, parameters d = 1, l = 1 are selected because only passenger flows of a station itself are used as the model inputs to predict the passenger flows of the station. Parameter n represents the sample size of D (i.e., 59 days' smartcard data in the training dataset).
Taking the prediction of out-passenger-flows N out (s,t target ) at the subway station "Window of World" during 9:00 a.m.-9:15 a.m. of December 30 as an example, s denotes the "Window of World" subway station, and t target denotes the target time window 9:00 a.m.-9:15 a.m. When predicting passenger flows at one time window ahead of the target time window (n step = 1), historical passenger flows at the station s during the time window t target − 1 of all days in the training dataset D = {(x 1 ,y 1 ),(x 2 ,y 2 ),. . .,(x n ,y n )} are used. For the proposed example, x n represents the out-passenger-flows at the subway station "Window of World" during time window 8:45 a.m.-9:00 a.m. of the nth day in the training dataset, and y n represents the out-passengerflows of the "Window of World" station during time window 9:00 a.m.-9:15 a.m. of the nth day in the training dataset.
Multilayer perceptron neural network model. The multilayer perceptron is a forward structure artificial neural network that maps a set of input vectors to a set of output vectors. An MLP consists of multiple layers, including an input layer, one or more hidden layers, and an output layer. Each layer of neurons is interconnected with the next layer of neurons. There is no connection between neurons in the same layer, and there is no cross-layer connection. For both the hidden layer and output layer, neurons have activation functions, whereas on the input layer, neurons only receive the input dataset and do not have activation functions. The learning process in neural networks involves adjusting the connection weights between neurons and the threshold of each functional neuron.
We considered a three-layer MLP network consisting of d input neurons, a hidden layer with q hidden neurons, and an output layer with l output neurons. The threshold of the jth neuron in the output layer is defined as θ j , and the threshold of the hth neuron in the hidden layer is defined as γ h . Connection weight v ih represents the weight between the ith neuron in the input layer and the hth neuron in the hidden layer, whereas connection weight w hj represents the weight between the hth neuron in the hidden layer and the jth neuron in the output layer. Therefore, each hidden neuron h firstly computes the net input a h ¼ P d i¼1 v ih x i and generates an output b h . Each output neuron j uses the outputs of the hidden layer as inputs is the activation function, and the rectified linear unit function f(x) = max(0,x) is used here as the activation function. Therefore, the mean square error of the network is The update of any parameter v is defined as v v + Δv. The training process of the MLP with backpropagation is as follows.
Step 1: Input the training dataset D = {(x 1 ,y 1 ),(x 2 ,y 2 ),. . .,(x n ,y n )},x n 2 R d ,y n 2 R l and determine the activation function. In this paper, the number of hidden neurons q was set to 100, the tolerance for stopping criterion is set to 0.0001(i.e. value of (1) is smaller than 0.0001), and the maximum number of iterations is 200.
Step 2: All connection weights and thresholds in the neural network are initialized randomly in the range of (0, 1).
Step 3: For (x k ,y k ), according to current parameters and functionŷ k j ¼ f ðb j À y j Þ, calculate the value ofŷ k . The mean square error of the network is computed as Step 4: Update the connection weights w hj and v ih and the thresholds θ j and γ h .
The error backpropagation (BP) algorithm based on the gradient descent strategy adjusts the parameters [46,47].
Step 5: Repeat Steps 1-4 until the value of (1) satisfies the predefined tolerance for stopping criterion. Support vector regression model. The kernel function F is used to map data into a highdimensional feature space, such that the nonlinear fitting problem in the input space is transformed into a linear fitting problem in the high-dimensional feature space. Common kernel functions include linear kernel, polynomial kernel, gaussian kernel, Laplace kernel, and sigmoid kernel, where the nonlinear mapping function is k( The goal of the support vector regression model is to find the partition hyperplane with the maximum margin. The partition hyperplane is represented by Suppose ε is the error bound between observation value y and predicted value f(x). With f (x) as the center, the epsilon-tube with a width of 2ε is established, and then the problem is formalized [46] as where C is the penalty coefficient, and In summary, the support vector regression model can be described as follows.
Step 1: Input training dataset D = {(x 1 ,y 1 ),(x 2 ,y 2 ),. . .,(x n ,y n )},x n 2 R d ,y n 2 R l and select a kernel function k(x i ,x j ). In this paper, the linear kernel was chosen as the kernel function, the parameter C was set to 1, ε was set to 0.1, and the tolerance for stopping criterion is set to 0.001 (i.e. the value of (6) is smaller than 0.001).

Gradient boosted regression trees model (GBRT)
The gradient boosted regression trees model (GBRT) is described as follows.
Step 1: Input the training dataset D = {(x 1 ,y 1 ),(x 2 ,y 2 ),. . .,(x n ,y n )},x n 2 R d ,y n 2 R l and initialization function The loss function is L y; f ðxÞ ð Þ ¼ 1 2 ðy À f ðxÞÞ 2 , where the constant value c minimizes the value of P n i¼1 Lðy i ; cÞ, namely, c is as close as possible to y i . Here, f 0 (x) is a tree with only one node.
Step 2: The training dataset is used as input to iteratively build M trees, M was set to 100 in this paper.
(a) For the mth tree, m = 1,. . .,M, calculate the negative gradient of the loss function in the current model Then, use r mi as an estimate of residuals, where i = 1,. . .,n, @ stands for the derivative, and n is sample size.
(b) Fit a regression tree for r mi to obtain the leaf node regions R mj of tree m, where j = 1,2,. . .,J, and J is the number of leaf nodes, which is not limited in the present study.
(c) For the leaf node region R mj , where j = 1,2,. . .,J, calculating the best fitting value c mj to minimize the loss function L(y,f(x)).
Then, update Step 3: The final prediction model is Detecting anomalous passenger flow condition The DBSCAN algorithm was used to identify anomalous passenger flows. We normalized inor out-passenger-flow of a subway station s during each time window t of a day N(s,t) with the minimum and maximum values of N(s,t) observed in the same time window during the whole data collection period, and take it as the original data set S. In the DBSCAN algorithm, the maximum radius of neighborhood ε defines the eps-neighborhood of a data point i 2 S, denoted by N ε (i) = {j 2 S|dist(i,j) ε}, and MinPts determines the minimum number of data points within the eps-neighborhood. The Euclidean distance dist(i,j) = |N(s,t) j − N(s,t) i | was used to locate the ε neighborhood of each data point i, and the typical parameter setting of MinPts = 4 was used. The maximum radius of neighborhood ε was set using the fourth distance (4-dist) probability [50]: the distance between a data point and its fourth nearest neighbor is denoted as the 4-dist. The probability distribution of 4-dist was fitted by an exponential function, and the 4-dist value at which the slope of the fitting curve equaled -1 was used as the parameter setting of ε. Passenger flows were classified using the DBSCAN algorithm: passenger flows larger than the maximum flow f ε of the largest cluster were classified into the anomalous passenger flow (traffic) condition. Passenger flows smaller than or equal to f ε were classified into the ordinary traffic condition. We use the out-passenger-flows N out (s,t) at the subway station "Window of World" during the time window 7:00 p.m.-7:15 p.m. as an example (Fig 3B). Here, s denotes the subway station "Window of World", the target time window of the prediction is t = 76. The label of each cluster generated by the DBSCAN algorithm is denoted by the label(r), where 1 r n c , n c is the total number of clusters. During time window t of the ith day, the out-passenger-flows at the studied station s is denoted as label(N out (s,t) i ). When the label label(N out (s, t) i ) is the same with the label of the largest cluster generated label(r) max , the threshold passenger flow f ε is determined f ε ¼ max labelðN out ðs;tÞ i Þ¼labelðrÞ max ðN out ðs; tÞ i Þ.
In Fig 3A out-passenger-flows N out (s,t) at "Window of World" station of the Shenzhen subway system are illustrated for every 15 min time windows. Using the DBSCAN algorithm, the threshold passenger flow f ε for each 15 min time window was determined. Anomalous growth of passenger flows was observed on December 31, which was caused by the firework show at the plaza of the recreational park at "Window of World" [51]. For all subway stations, anomalous in-passenger-flows were found in 12.2% of time windows, whereas out-passenger-flow were found in 10.3% of time windows.

Predicting dynamical passenger flows
Previous passenger flow prediction models have been seldom analyzed under anomalous traffic conditions, such as abrupt bursts of passenger flows in a particular subway station due to mass commercial or recreational events. Under anomalous traffic conditions, passenger demands may exceed the maximum capability that a subway station can provide; emergent managements are therefore required to protect the safety and order of subway transportation. In addition, under large crowd gatherings, subway service restrictions can be an important way to prevent passengers from flowing into the crowded area, hence avoiding dangerous crowding situations [52]. Therefore, predicting passenger flows under anomalous traffic conditions is even more important than predicting flow under ordinary conditions. Three typical indexes, mean absolute percentage error (MAPE), variance of absolute percentage error (VAPE), and root mean square error (RMSE) were used to evaluate the accuracy of prediction: where y = {y 1 ,y 2 ,. . .,y i ,. . .y n } is the sequence of the observation values,ŷ ¼ fŷ 1 ;ŷ 2 ; . . . ;ŷ i ; . . .ŷ n g is the sequence of the prediction values, and n is the number of observation values.

High-performance regions of different predictive models
We analyzed the performances of four predictive models under different numbers of time windows n step that a prediction is made before the target time window n target . When a larger n step Subway passenger flows prediction was set, the passenger flow prediction results could be obtained early; meanwhile, the accuracy of prediction decreased as more recent data were not used. Here, we explored the high-performance regions of different predictive models under ordinary and anomalous traffic conditions. Fig 5 shows the predicted passenger flows at the subway station "Window of World" on December 30, 2014. We found that under the ordinary condition, except for the MLP model, the predictive models performed well even when the passenger flow prediction was conducted 2 h before the target time window. The prediction accuracy of the MLP model began to decrease when n step was larger than two time windows, indicating that under the ordinary traffic condition the MLP model only worked well for short-term (less than 30 min) prediction. Fig 6 shows the predicted passenger flows of the subway station "Window of World" on December 31, 2014. In contrast to the results under the ordinary traffic condition, we found that under anomalous traffic condition, the MLP model performed the best. Given that the HA model is insensitive to the prediction time, the same predictive results were obtained for different numbers of time windows n step that a prediction is made before the target time window, and the HA model could not capture the anomalous traffic condition at all. For all predictive models, the prediction accuracy was not acceptable when the target time window was four time windows (1 h) later than the prediction time. The predicted results of all models had a trend to approach historical average values when n step ! 4. Table 2 shows the RMSE, MAPE, and VAPE values of predictive results of passenger flows at the "Window of World" station based on the SVR, MLP, GBRT and HA model. The prediction was made 1 to 8 time windows ahead of the target time window, respectively.
We summarized the performance of the four predictive models in Fig 7. Under the ordinary traffic condition, the prediction errors of the SVR model, MLP model, and GBRT model all increased with the increase of the number of time windows n step between the prediction time and target time windows. In particular, the RMSE and MAPE values of the prediction results of the MLP model increased much faster than for the SVR and GBRT models. When prediction was made n step > 5 time windows earlier than the target time window, the MLP model had even worse performance than the HA model. The minimum MAPE = 16.9% was generated by the SVR model when n step = 2, implying that the most recent data may be not the best data input. Furthermore, the GBRT model had a larger VAPE value than the MLP and SVR models. All the results taken together, the SVR model performed best in the ordinary traffic condition.
When a larger n step was set (the prediction time (t) is earlier than the target time window t target , prediction errors of the SVR model, MLP model, and GBRT model increased faster under the anomalous traffic condition than under the ordinary traffic condition. The RMSE value and MAPE value of the prediction results of the MLP model, SVR model, and GBRT model were similar, but for most n step settings the MLP model was slightly better. The minimum MAPE = 18.0% was generated by the MLP model when n step = 2, also implying that the most recent data may be not the best data input. The GBRT model had a larger VAPE value when n step was small, which increased slowly with increasing n step ; meanwhile, the VAPE value for the prediction of the MLP model was small when n step 2, but had faster growth afterward.
Ultimately, the MLP model performed best in the anomalous traffic condition. Table 3 and  Table 4 shows the RMSE, MAPE, and VAPE values of the four models under different n step settings for in-passenger-flow and out-passenger-flow predictions. Given that all validations were made for all subway stations (118 in total) of the Shenzhen subway network, the average MAPE and RMSE values were enhanced by the majority of lowpassenger-flow stations. If we concentrated on subway stations with the top 25% average passenger flows, the minimum MAPE = 11.1% was generated by the SVR model (and GBRT model) when n step = 2 for the ordinary traffic condition; meanwhile, the minimum Subway passenger flows prediction MAPE = 12.3% was generated by the MLP model when n step = 2 for the anomalous traffic condition. This result indicates that the inherent pattern of passenger flows at a subway station prominently determines the prediction accuracy. In general, passenger flows of large-flow stations are more predictable than passenger flows of low-flow stations. In addition, for a specific group of subway stations, the best model may be different from the model obtained for all subway stations. In practice, more detailed model selection strategies can be implemented to different subgroups of subway stations. Table 5 and Table 6 describe details of RMSE, MAPE and VAPE values of the four models when n step = 2.

Conclusions
An effective and reliable passenger flow prediction model can be beneficial to the management of transportation systems, such as operation planning, revenue planning, and facility improvement. In this paper, we generated four models to predict passenger flow in each station of the Shenzhen subway system. We investigated how long in advance passenger flows could be accurately predicted. Under ordinary traffic condition, acceptable results can be obtained even 2 h in advance, while under anomalous traffic condition, the prediction accuracy of all predictive models was not acceptable when prediction was made 1 h in advance. Li et al. [53] compared detrending models and multi-regime models trying to find appropriate traffic prediction models in practices. Our finding highlights the importance of selecting proper models, SVR model and MLP model respectively performed best in ordinary and anomalous traffic conditions. Our finding also highlights that compared with the selection of models, inherent patterns of passenger flows are more prominently influencing the accuracy of prediction. According to the analysis and results of the present study, when passenger flows are relatively stable, SVR prediction model is suggested. When passenger flows show anomalous patterns, the MLP prediction model can achieve more reliable prediction results. In addition, how long the prediction is made in advance of the target time window should also be considered. As the time window ahead of the target time window n step increases, the prediction   Subway passenger flows prediction error increases. The prediction errors of the three types of knowledge discovery models gradually approach the prediction error of the simple HA model. Hence, in the condition that n step is large, the simple HA model can be a good option given its low computation cost. We think our results can offer useful information for the management of public transportation, which includes adjusting operating frequency and alleviating passenger congestion. Finally, we would like to discuss the limitations of this study, and future work. First, the four predictive models were used in their most basic forms and we did not cover all existing models for traffic prediction. Variants of these fundamental models could further improve the accuracy of predictions. Next, better classifications of traffic conditions or subway stations could further improve the prediction accuracy, and are worthy of future work. Finally, transportation information on social media websites is usually prior to the emergence of actual mobility, and therefore incorporating this kind of information with traditional urban transportation data is definitely an interesting future research direction [54,55].
Supporting information S1 File. The minimal dataset to replicate this study. (CSV)