A Hybrid Spatiotemporal Deep Learning Model for Short-Term Metro Passenger Flow Prediction

Jiangsu Key Laboratory of Urban ITS, Southeast University, Nanjing, China Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies, Nanjing, China School of Transportation, Southeast University, 2 Dongnandaxue Rd., Nanjing, Jiangsu 211189, China School of Traffic Engineering, Huaiyin Institute of Technology, Meicheng East Road #1, Huaian, Jiangsu 223001, China Civil Aviation College, Nanjing University of Aeronautics and Astronautics, Jiangjun Road #29, Nanjing, Jiangsu 211106, China Business School, Huaian Vocational College of Information Technology, Meicheng East Road #3, Huaian, Jiangsu 223003, China


Introduction
e metro system constitutes an important supplementation to urban transportation systems, providing travellers with sustainable, reliable, and efficient mobility, reducing the number of trips made by private vehicles, and leading to reduced traffic congestion and vehicle emissions in urban areas. However, due to the effects of fluctuating spatial and temporal travel demand, the metro system has suffered from a series of problems in recent years, such as overcrowded platforms and poor service levels [1]. Accordingly, an accurate passenger flow prediction has become an important component of metro transportation system. e prediction results can be applied to support metro system management such as operation planning and station passenger crowd regulation [2][3][4][5].
Over the past few decades, considerable efforts have been devoted to predicting the short-term metro passenger flow [2][3][4][6][7][8][9]. Wang et al. proposed a support vector machine combined online model for short-term metro ridership forecasting. e results indicated that the proposed model could better capture the periodicity and nonlinearity characteristics of metro ridership with the collected data from Nanjing metro system [6]. Zhang and Liang used an improved Kalman filter model to forecast short-term passenger flow in Beijing metro system [9]. Hao et al. proposed a sequence to sequence model embedded with the attention mechanism to make the multistep and network-wide predictions for short-term passenger flow. e developed model has shown great abilities to capture the long-range dependencies and achieved more scalable and robust prediction performance than other baselines [4].
Most previous studies have considered the short-term metro passenger flow prediction only as a typical time-series problem and failed to incorporate the hidden spatial correlations between stations to enhance the prediction accuracy [2,7,8]. For example, station #A and station #B may exhibit similar passenger flow patterns because they are both close to college campus (see Figure 1). In addition, station #C and station #D are less than 1 km apart and locate in the same business district and may both have high passenger flow during the same peak period. Considering the spatiotemporal nature, it is essential to integrate both spatial and temporal characteristics into the short-term metro prediction models, which has great potential for improving the prediction performance in practical applications [10].
During the past few years, some researchers have proposed various hybrid models for considering both spatial and temporal dependencies in metro passenger flow prediction [2,7,8,11]. For example, Zhu et al. developed an ARIMA-Wavelet model for forecasting the daily passenger flow of Beijing urban rail transit [7]. Li et al. combined linear ARIMA model and nonlinear symbolic regression to model the complex relationship beneath the passenger flow dataset. e results suggest that the developed hybrid model outperforms the single model with the real dataset from Xi'an metro line 1 [8]. Wei and Chen proposed a hybrid EMD-BPN approach for short-term passenger flow prediction of Taipei metro system with three stages [2].
More recently, deep learning models have been widely used in various transportation researches because they have great power to extract the hidden nonlinear relationships with distributed and hierarchical feature representations [3,[12][13][14][15]. Liu et al. proposed an end-to-end deep learning framework based on long short-term memory neural network (LSTM) for short-term metro passenger flow prediction [3]. e results suggest that multiple LSTM layers could better capture the temporal dependencies in the passenger flow data and exhibit great prediction performances. However, the LSTM neural network can only capture the temporal dependencies but fail to extract the spatial correlations among different stations. To address this limitation, this study proposes a hybrid spatiotemporal deep learning model based on convolutional LSTM neural network for short-term metro passenger flow prediction. In recent years, the convolutional LSTM neural network has been used to solve various spatiotemporal transportation prediction problems, such as taxi demand forecasting [14], crash risk prediction [15], and bus travel time prediction [16]. To the best of the authors' knowledge, this paper is one of the first attempts to employ convolutional LSTM neural network for short-term metro passenger flow prediction. e main contributions of this paper could be summarized as follows.
(a) e proposed hybrid spatiotemporal deep learning neural network (HSTDL-net) has great abilities to learn both spatial and temporal dependencies among the passenger flows of metro stations (b) e proposed hybrid spatiotemporal deep learning architecture could integrate both convolutional filters and recurrent component in one end-to-end structure, which could learn the spatiotemporal patterns of passenger flows more efficiently (c) Validated by the real dataset provided by Nanjing metro system, the proposed hybrid spatiotemporal deep learning model outperforms the selected benchmark methods, including conventional timeseries models and several state-of-the-art machine learning approaches e rest of the paper is organized as follows. Section 2 discusses the methodology of the convolutional LSTM neural network and the structure of proposed HSTDL-net. Section 3 describes the data source. Section 4 presents the results of data analysis and compares the predictive performance between the proposed approach and the benchmark models. Finally, conclusions are drawn and future research directions are indicated in Section 5.

Methodology
In this section, we construct a hybrid spatiotemporal deep learning neural network for predicting short-term metro passenger flow. e proposed HSTDL-net integrates both LSTM neural network and convolutional LSTM neural network into an end-to-end deep learning architecture. e used methods are briefly discussed in this section.

2.
1. e Structure of HSTDL-Net. In this study, two different types of variables are incorporated in the short-term metro passenger flow prediction. e first type of variables is both spatially and temporally varied during the study period, such as the inbound passenger flow and outbound passenger flow variables. ere exist both spatial dependencies and temporal dependencies in this type of variables. e second type of variables is only temporally varied but spatially static during the study period, such as the weather variables and air quality variables.
ere exist strong periodicity and only temporal dependencies in this type of variables.
In this study, to capture the spatial and temporal dynamics of the two types of variables, we propose a hybrid spatiotemporal deep learning neural network (HSTDL-net) for predicting short-term metro passenger flow. Figure 2 illustrates the structure of the proposed HSTDL-net. More specifically, stacked convolutional LSTM layers are developed to capture the spatiotemporal features in the first type of variables. e stacked LSTM layers are developed to extract the temporal features in the second type of variables. e extracted high-level features from the two components are further merged together and input into multiple fully connected layers to generate the final predicted passenger flow value. e used methods in each component are briefly explained as follows.

Long Short-Term Memory (LSTM) Neural Network.
Long short-term memory (LSTM) neural network is a specific type of recurrent neural network (RNN), which has exhibited great performance in forecasting time-series datasets [13,17,18]. For example, Ma et al. have attempted to use LSTM neural network for capturing the temporal patterns in shortterm traffic speed data with the collected remote microwave sensor data [13]. Wollmer et al. have applied LSTM neural network to online driver distraction detection with the driving and head tracking data [18]. e LSTM neural network could address the issues of gradient exploding and gradient vanishing, which are very prevalent in traditional RNN with large prediction time step [17]. e most important components in LSTM neural network are the three kinds of gates in the memory cell of hidden layer (see Figure 3). Specifically, the forget gate is designed to eliminate information that is not related to the prediction task.
e input gate controls the information that can be considered in the prediction task. For time step t, the input gate i t , forget gate f t , and output gate o t are calculated iteratively in the following ways ((1)-(5)): where x t indicates the second type of variables mentioned in Section 2.1, which are only temporally varied but spatially static. Specifically, in this study x t represents the input historical weather variables in each time step such as temperature, precipitation, wind speed, and pressure; c t indicates the activation vectors for each cell; h t indicates the related predicted value. W xi indicates the weight matrix between input weather variable and the output of input gate. b i indicates the bias value of input gate. Similarly, W hi , W ci , W xf , W hf , W cf , W xc , W hc , W xo , W ho , and W co indicate the weight matrixes connecting the vector of the first subscript to the vector of the second subscript. b c , b f , and b o indicate the relevant bias values. ʘ represents the element-wise product between weights' matrix and bias matrix. σ and tan h represent the active functions in LSTM neural network with the following forms: In many previous studies, multiple LSTM neural networks were structured in a stack form such that more complex temporal dependencies among variables could be learned and the prediction accuracy could be further improved [14,15].

Convolutional LSTM Neural Network.
In this study, the metro passenger flow not only exhibits significant temporal patterns for each station, but also shows great spatial patterns across different stations. For example, two stations with similar surrounding land use may exhibit similar passenger flow patterns. Due to this spatiotemporal nature, the standard LSTM neural network is not an ideal model for short-term Journal of Advanced Transportation metro passenger flow prediction because it cannot learn the spatial dependencies among variables [14,16]. To overcome this limitation, Shi et al. innovatively combined the convolutional layers and LSTM layers into an end-to-end deep learning structure, the convolutional LSTM, which could model both spatial and temporal characteristics simultaneously [19]. e most important feature of convolutional LSTM neural network is the convolution operation between neighboring LSTM cells. Specifically, all the inputs, hidden states, and outputs of various gates are transformed to 3D tensors in convolutional LSTM neural network (see Figure 4(a)). en, the convolutional filters are applied to the input-to-state and state-to-state transitions for a certain passenger flow grid cell (see Figure 4(b)).
For each time step t, the input gate I t , forget gate F t , and output gate O t work in a similar way as the standard LSTM neural network ((7)-(11)): First type of variables (e.g. inbound and outbound passenger flow) Second type of variables (e.g. historical weather related information) where the operator * refers to the convolution operator, which is the greatest difference between convolutional LSTM and standard LSTM neural network. X t refers to the tensor of input metro passenger flow for time step t. X t is a 3D tensor and can be regarded as a temporal stack of the passenger flow map for all the stations. I t , H t , C t , and F t refer to the input gate tensor, hidden tensor, cell output tensor, and forget gate tensor, respectively. Here, W xi , W hi , W xf , W hf , W xc , W hc , W xo , and W ho serve as convolutional filters, which are replicated across the tensors with shared weights, and thus explore the spatial local correlations. Zero padding technique is applied to ensure a consistent spatial dimension during the convolutional operation. e recurrent structure and the convolution component in convolutional LSTM neural network make both spatial and temporal patterns of short-term metro passenger flow which could be better learned. Similar to standard LSTM neural network, multiple convolutional LSTM neural networks could also be structured in a stack form for building a more robust and accurate predictor. In this study, the input metro passenger flow of all the stations for each time step is first reshaped to a tensor structure X � (X 1 , X 2 , . . ., X t ) (see Figure 4). en, through a two-layer convolutional LSTM neural network, the input metro passenger flow tensors are mapped to a sequence of hidden tensors H � (H 1 , H 2 , . . ., H t ).

Feature Merge Layer and Fully Connected Layer.
e temporal features captured from LSTM neural network and the spatiotemporal features captured from convolutional LSTM neural network are then concatenated into a dense vector in the feature merge layer. Finally, the dense vector is input into several fully connected layers to obtain the final predicted value of metro passenger flow.
where X LSTM t and X ConvLSTM t indicate the extracted features by LSTM and convolutional LSTM layers at t time step, respectively. W LSTM , W ConvLSTM , and b t indicate the related weights and bias.

Objective Function.
e objective of the short-term metro passenger flow prediction model is to minimize the error between the real and predicted passenger flows for each metro station at every time step. During the whole training process, the objective function is given as follows: Figure 3: Illustrations of long short-term memory neural network.
where y (i, j) andŷ (i, j) stand for the real and predicted passenger flow for metro station i at time step j, respectively. n refers to the number of metro stations and m refers to the number of total predicted time steps, and n p � n × m.
2.6. Evaluation Metrics. ree statistics are used to evaluate the performance of the proposed metro passenger flow prediction model, including root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE), as formalized in (13)-(15).

Data Source
In this study, the metro passenger flow dataset is collected from Nanjing metro line 2, which is maintained by the Nanjing Metro Corporation (NMC). Nanjing city is the economic and cultural center of Eastern China with more than 8 million people [20]. e Nanjing metro network began in 2005 and has expanded rapidly over the past decade due to the growth of urban travel demand. e metro line 2 was completed and operated in 2010, connecting the city's east and west areas [21]. As is shown in Figure 5, Nanjing metro line 2 has 26 stations with a total length of 38 km, including 2 terminal stations, 2 transfer stations, and 22 regular stations.
e Nanjing metro line 2 is available from 6:00 am to 11: 00 pm. In the Nanjing metro system, passengers need to tap in and tap out smart cards when they enter and exit the station. us, each trip record provides the following information: smart card id, card type, total amount, the timestamp of enter or exit, and the station name. In this study, the short-term metro passenger flow is defined as the number of passengers that enter into a station for every 10 minutes. e study period covers a total of 66 weekdays from July 1 st to September 30 th , 2014. For the purpose of model training, the first 50 days are split as the training dataset and the remaining 16 days are used for test. e weather variables are collected from the Nanjing Meteorological Bureau which provides hourly aggregated weather information through local meteorological stations. e obtained weather variables include the hourly aggregated temperature, precipitation, wind speed, and pressure. e considered variables in the present study are described in Table 1. Figure 6 further illustrates the temporal distribution of daily passenger flows for three typical station types. It can be found that the passenger flow patterns of different types of metro stations vary greatly. For the regular stations, their passenger flows show an obvious morning peak and evening peak period. For the terminal stations, their passenger flows only show a morning peak. By contrast, the passenger flows of transfer stations remain relatively high in the daytime and may still have a rapid growth around late-night time period. In general, the passenger flow patterns have great diversity among various types of metro stations, leading great challenges to the short-term metro passenger flow prediction.

Model Construction.
A series of parameters should be determined in the process of model construction. In this study, uniform random search is applied to select the optimized value. More specifically, the input passenger flow of 26 metro stations was distributed in the form of a grid map with a length of 2 and a width of 13. For the convolution filter, the filter size was set as (2 × 2), which is the common setting in many previous studies [9,22,23]. e filter length was optimized from 10 to 40 (see Table 2). For the recurrent component, the number of time steps and the number of hidden units in the LSTM cell were chosen by the results of uniform random search. For the component of optimizer, four widely used optimizers were compared, including Adam, Nadam, RMSprop, and SGD [24]. e RMSprop was selected as the best one, and the learning rate was set from 0.001 to 0.01. For the training related parameters, the batch size was set from 20 to 80, and the number of training epochs was set from 50 to 200. In addition, to address the overfitting issue, the dropout layer was applied in this study [25]. e uniform random search interval and optimal parameter values are listed in Table 2.
e loss function of the training model is the root mean square error (RMSE). All the input variables were standardized by min-max normalization before conducting the model construction.
e proposed convolutional LSTM model was implemented on the basis of Keras framework with TensorFlow as backend [26]. All the experiments were conducted with Python 3.5.2 in a Windows 10 system. To accelerate the efficiency of model training, a GTX 1060 Graphics Processing Card is used. e results of the optimal parameters are shown in Table 2.

Results of the Proposed HSTDL-Net.
In order to compare the model performances for different types of metro stations, four test datasets were predicted by the trained HSTDL-net. As is shown in Tables 3 and 4, the prediction results indicate that models of transfer stations usually generate the highest RMSE and MAE values for both inbound and outbound passenger flows. is is expected because transfer stations have large passenger volume and may result in larger predicted residuals. Moreover, the models of terminal stations show the worst performances in terms of the highest MAPE values for both inbound and outbound passenger flows. e passenger flow patterns of terminal stations are more unstable and complex, which brings great challenges to the prediction task. e residual errors of suburban stations are generally lower than those of urban stations. e suburban stations exhibit better prediction performance because they are mainly located around residual places with commuting as the primary trip purpose and accordingly exhibit a more stable passenger flow pattern. In addition, it can be found that stations near large shopping centers, such as Xinjiekou and Shanghailu station, show a relatively poor performance in terms of higher residuals. e reasons are mainly twofold. First, stations located in the central business areas may be also transfer stations involving multiple metro lines, resulting in greater fluctuations of passenger flow. Second, travels for shopping and entertainment are more flexible and irregular than commuting, making passenger flow prediction of these stations become more difficult.

Results of Model Comparison.
e proposed HSTDL-net is further compared with other prevalent prediction models using the same dataset. Specifically, CNN, LSTM, ARIMA, multilayer perceptron (MLP), and the gradient boosting regression tree (GBRT) are selected as the benchmark models in this study. e selected models include statistical method, machine learning method, ensemble tree based method, and deep learning method, which ensure a fair and comprehensive comparison. Moreover, these selected methods have also been widely applied in many previous studies of short-term passenger flow prediction [3,7,8,11,27].    ARIMA is the most conventional regression method for modeling time-series dataset. ARIMA method integrates the moving average model, the autoregressive component, and the moving average part [28]. MLP is a typical architecture of feedforward neural network, which consists of multiple fully connected layers. e hidden layers in MLP could capture the complex nonlinear relationships in the time-series dataset [23]. CNN is an emerging hot method, which has achieved great successes in the fields of image recognition and signal processing [22,23]. More recently, many researchers have also applied it for solving various transportation problems [14,15,29]. e most important feature of CNN is that the neurons are connected to the preceding layer through a moving patch with the same weight values.
us, CNN based methods are good at extracting the spatial dependencies among predicted variables. LSTM is a particular type of recurrent neural network, which could account for the gradient exploding and gradient vanishing problems [17,30]. LSTM has been adopted to forecast the short-term passenger flow in previous studies [3,18]. e mechanism has been discussed in previous section. GBRT is built on the basis of the core idea of ensemble tree, which aims to improve the model performance and model robustness by integrating the prediction results from multiple week regression trees [27]. After achieving great successes in other fields, GBRT has become more and more prevalent in many transportation studies [27,31].
To ensure a fair comparison, all the compared models are fine-tuned under the same training dataset and number of training epochs. However, the structure of the input data for HSTDL-net cannot be directly applied to other compared models. Some data processing work should be conducted to satisfy the requirements of model input. Specifically, for the ARIMA, MLP, and GBRT models, the input passenger flow of all stations in the past T time steps is reshaped as a matrix in the form of (batch size, T). For CNN model, the input passenger flow data are reshaped as a 4D tensor in the form of (batch size, 2, 13, T). T indicates the channel of the image and (2,13) indicates the size of the image. For LSTM model, the input passenger flow data are reshaped as a 3D tensor in the form of (batch size, T, 1). Figure 8 visualizes the results of model comparisons in terms of MAPE value. Overall, for both inbound and outbound passenger flow predictions, all the models achieve the best performance on transfer stations but show the worst performance on terminal stations. In addition, the HSTDLnet has the lowest MAPE values compared to the other models, indicating that the proposed model could fully capture both spatial and temporal dependencies among passenger flow variables. Tables 5 and 6 further list the prediction results of all compared models. It can be found that ARIMA model achieves the lowest prediction accuracy in terms of the highest RMSE, MAE, and MAPE values for all the three types of stations. Moreover, for regular stations the   respectively. e results of comparative analysis indicate that the proposed HSTDL-net can more effectively and fully discover both spatial and temporal hidden correlations between stations for short-term metro passenger flow prediction.

Conclusions and Discussions
e present study investigates the short-term metro passenger flow prediction with the advanced deep learning technology. A total of three-month trip records are obtained from the metro line 2 of Nanjing metro system. e passenger flow patterns have great diversity across different types of metro stations, which have led to great difficulties in short-term passenger flow prediction. In this study, a hybrid spatiotemporal deep learning neural network (HSTDL-net) is proposed for predicting both inbound and outbound passenger flows on three types of stations for every 10 minutes. e developed HSTDL-net innovatively incorporates the convolution operation between LSTM cells, which could capture both spatial and temporal dependencies among passenger flow variables.
In general, the proposed HSTDL-net achieves greater prediction performance on suburban stations than on urban stations. For each metro station type, the model of transfer stations exhibits the best prediction accuracy and the model of terminal stations performs the worst. Moreover, a comparative analysis is conducted to compare the prediction performance between the proposed HSTDL-net and other   e results suggest that the proposed HSTDL-net can more effectively and fully discover both spatial and temporal hidden correlations between stations for short-term metro passenger flow prediction.
e results of short-term passenger flow prediction could provide insightful suggestions for decision makers of metro systems. More specifically, the accurate prediction results can help the metro system authorities to dynamically modify the operation plans according to the fluctuation of passenger flow, such as adjusting the headway and train dispatching schedule to ensure the service quality of the entire metro system. In addition, with the predicted passenger flow in the next multiple time steps, the passenger congestion can be identified in advance such that timely crowd regulation plan and emergency response plan can be made. For example, the metro system management can assign extra trains and add more volunteers for relieving the passenger congestion and improving the service level of metro system. is study is the first step towards exploring the short-term passenger flow prediction with advanced deep learning technology. Our future work will focus on incorporating other data sources into short-term metro passenger flow prediction. For example, the real-time weather information and the land use pattern surrounding each metro station may show close relationships with the passenger flow pattern. Moreover, the proposed HSTDL-net has achieved great performance on a single metro line. However, the model performance on an entire metro network with multiple metro lines should be further tested.
e authors recommend future studies that could focus on these issues.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon the request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.