Multi-featured spatial-temporal and dynamic multi-graph convolutional network for metro passenger flow prediction

Metro passenger flow prediction is an essential part of crowd flow forecasting and intelligent transportation management systems. However, two challenges still need to be addressed to achieve a more accurate prediction: (1) accounting for featural dependence instead of considering only the temporal connection and spatial relations; (2) utilising graph structures to address non-European relationships of spatial and featural dependence. To address these challenges, we developed a novel model called the multi-featured spatial-temporal (MFST) and dynamic multi-graph convolutional network (DMGCN) model. Temporal connections are learned from both the local and global information in a time-series sequence using the combination of a time-trend feature mapping block and a gated recurrent unit block. Spatial relation and featural dependence are separately captured by two DMGCN blocks. Each DMGCN block encodes various relationships by constructing multiple graphs consisting of predefined and non-defined topologies. The results of evaluations conducted of the MFST tensor and the DMGCN on the real-world Beijing subway dataset indicate that the prediction performance of the proposed model is superior to that of the existing baselines. The proposed model thus contributes significantly to the improvement of public safety by providing early warnings of large passenger flow and enabling the smart scheduling of resources.


Introduction
Owing to the increasing demand for smart cities, intelligent urban rail transit has always been an important area that needs to be highlighted. To date, metro has been one of the most popular means of transport. However, it has been susceptible to security issues due to massive crowds. Improving the effect and quality of passenger flow prediction provides is significantly aids in avoiding public safety accidents caused by the congestion. With the advances in artificial intelligence and the application of deep learning in multiple domains (Chang et al., 2021;Chen et al., 2021;LeCun et al., 2015;Samek et al., 2021), several researchers have focused on implementing an effective and accurate prediction of metro passenger flow Wang et al., 2021;Zhang et al., 2021). However, despite years of research achievements, the dynamic capturing of various dependencies in the dimensions of time, space, and features remains a challenge in the accurate metro passenger flow prediction problem.
The featural dependence considerations are a vital challenge for accurate flow predictions, in addition to the temporal connections and spatial relations. As shown in Figure 1, the metro passenger flow contains time, space, and feature variables.  Figure 2 indicates the typical passengers who travel from the urban fringe (inflow) to the centre of the city (outflow) in the morning rush hour. The four flows are converted between themselves along with the changes in both space and time.
According to the elaboration provided, we simultaneously consider abundant relationships, including temporal connection, spatial relation, and featural dependence, implying that the raw input is deemed as a multi-featured spatial-temporal (MFST) tensor. In previous studies, the output is predicted by considering the inflow and outflow as inputs (Bai et al., 2021;Gong et al., 2022;Han et al., 2019). To achieve the same forecasting goal, our proposed method considers four types of flows. Further, the superior performance resulting from considering MFST tensors is demonstrated on a real-world dataset.
The following two examples are provided to explain the challenge of capturing the various relationships with graph structures. As shown in Figure 3, the information regarding the Dongdan station represented by Vertex A is learned by considering its connected stations represented by Vertices B, C, D, and E. In the first case, Vertex A is linked with B, C, D, and E, implying that the information updated in A is only influenced by Vertices B, C, D, and E. Vertex A also receives the shift values from Vertex Z when the sampling time is long, which is ignored by the GCN approaches. In the second case, suppose that Vertex B possesses an adequate flow that is different from that in Vertex E; however, when computing the information update of Vertex A, both Vertices B and E share the same influential factor.
To address the aforementioned two challenges, we propose a novel model known as MFST and DMGCN (MFST-DMGCN) to solve the accurate prediction tasks, where two  DMGCN blocks including featural and spatial (F-and S-DMGCN) blocks are specifically designed to capture the spatial relation and featural dependence, respectively. A timetrend feature mapping (TF-M) block and a GRU block are proposed to obtain the local trend information and global contests in the time-series sequence. The outputs of MFST-DMGCN can provide data fundamentals for intelligent urban rail management, such as avoiding crowding accidents and enhancing metro operation efficiency.
The contributions of this work are as follows: • We view the metro passenger flow as an MFST tensor to consider multiple relationships regarding not only the temporal correction and spatial relations but also the featural dependence of the whole metro system. • TF-M and gated recurrent unit (GRU) blocks are employed to consider the temporal local context and global information, whereas the DMGCN block is designed to dynamically obtain the non-European spatial relation using multiple graphs, including a predefined and a non-defined topology aided by the self-attention mechanism. • To better express the featural dependence so that the internal interactions among features are clearer and more reasonable, four passenger flows are acquired and employed in this research: inflow, outflow, up-flow, and down-flow. • The results of experiments conducted using raw data from the Beijing subway verify that the proposed model achieves better prediction performance than the existing baseline models.

Metro passenger flow prediction
Urban metro development has gained substantial research attention recently (Hou et al., 2019;Li et al., 2020;Liu et al., 2021). Increased demand for accurate forecasting in practice leads to increased interest in metro passenger flow prediction (Gong et al., 2022;Zhang et al., 2020). The related work is reviewed in the following two sections.

Time-series analysis and machine learning methods
In recent decades, several time-series analysis methods have been proposed for passenger flow prediction, such as the autoregressive integrated moving average (ARIMA) model (Yinna et al., 2019) and the vector autoregression (VAR) model . However, these statistical methods perform poorly because the practice metro passenger flow are numerous and nonlinear. Moreover, a group of machine learning approaches have been proposed, such as the support vector machines (SVM) model  and the genetic particle swarm optimisation SVM (GPSO-SVM) model (Mei et al., 2017). These approaches can compute nonlinear dependencies but express time consumption owing to heavily depend on high-quality feature engineering. With the advances in artificial intelligence and the application of deep learning in multiple domains, such as speech recognition (Hinton et al., 2012), semantic image segmentation (Jiang et al., 2021) and sentiment classification (Cao et al., 2021), a new insight has been developed to address the crowd flow prediction problem (Han et al., 2019;Liu et al., 2019;Wang, Huang, et al., 2020). Metro passenger flow forecasting has been firstly deemed as a time-series prediction. The long short-term memory (LSTM) network and the GRU, as an improvement of the LSTM, have been exploited for prediction tasks Hema & Kumar, 2021;Ji & Hou, 2017;Ma et al., 2015). However, these typical RNN-based approaches only consider temporal dependencies and fail to consider spatial relationships.

Deep learning methods in spatial-temporal prediction
With the performance improvements resulting from viewing the input data as a spatiotemporal sequence forecasting problem (Shi et al., 2015), a set of spatial-temporal deep learning approaches have been designed. Liu et al. (2019) proposed an end-to-end LSTMbased architecture. Bai et al. (2021) designed the A3T-GCN model to learn the short-time trend by using GRUs and learning the spatial dependence based on the GCN. Moreover, one-dimensional convolutional neural networks (1D-CNNs) have also been exploited to gather temporal connections in spatial-temporal prediction. Yu et al. (2017) employed convolutional structures to capture the temporal dynamic behaviours. Guo et al. (2019) utilised a CNN to exploit temporal dependencies from nearby times. However, restricted by the kernel size in convolution, these approaches are weak in capturing a long-term temporal sequence.

Utilising graph structures in non-European relationships
The GCN is excellent in aggregating the non-European information, that is widely utilised in several tasks, such as forecasting wind speed (Khodayar & Wang, 2019), exploring epidemics outbreak (La Gatta et al., 2021) and addressing recognition assignment (Wang et al., 2022). With the proposed spatial graph convolution methods Zhao et al., 2020), GCNs are being applied in passenger flow prediction to capture the spatial correlations. However, because the adjacency matrix in GCN is constantly, it is difficult for GCN methods to integrate rich topological information (Hu et al., 2022;Li et al., 2018;Wang, Zhu, et al., 2020). Multi-graph can capture the pairwise non-Euclidean correlations. For instance, the spatiotemporal DMGCN (ST-MGCN) designs to encode the complicated spatiotemporal dependencies into multiple graphs (Geng et al., 2019). However, the adjacency matrix in multi-graph is still constantly, the result is not good.
The attention mechanism can produce the influence intensity in graph structures based on the inputs (Qian et al., 2020). The attention-based spatial-temporal GCN (AST-GCN) model contains the spatial-temporal attention mechanism to capture the dynamic spatial-temporal correlations (Guo et al., 2019). The Graph WaveNet presents an adaptive dependency matrix to learn spatial dependency (Wu et al., 2019).
Motivated by these two extension approaches, we propose a DMGCN to consider richer non-European information by multi-graph and learn the influence intensity among graph structures via the self-attention mechanism.

Metro spatial network
A metro spatial network is defined as a graph G s = (V s , A s , W s ) describing the connection among all metro stations, where V s denotes a set of vertices |V s | = N in the metro space (N denotes the number of stations), A s , W s ∈ R N×N denotes two adjacent matrices, of which A s is filled with values of "0" and "1" ("1" indicates an edge-bridging between two vertices, whereas "0" indicates no edge between the two vertices), and the value of W ij in matrix W s ranges from "0" to "1" and signifies the strength of the edge that connects the vertices i and j.

Metro feature network
A metro feature network is defined as a graph G f = (V f , A f , W f ) describing the relationships among all the passenger flows, where V f denotes a set of |V f | = F vertices in metro features (F denotes the number of features in the metro flow), A f , W f ∈ R F×F are two adjacent matrices, of which A f is filled with values of "0" and "1" ("1" indicates an edge-bridging between two vertices, and "0" indicates no edge between two vertices), and the value of W ij in matrix W f ranges from "0" to "1" and signifies the strength of the edge that connects the vertices i and j.

MFST tensor
The MFST tensor contains three variables: time, space, and feature. These variables can be deemed as multiple observations of each station in all timestamps t, denoted as X t = (X 1 , X 2 , . . . , X t ) ∈ R N×F , where N represents the number of stations and F is the number of features.

Problem statement
MFST data are multidimensional tensors that contain time, space, and feature variables; however, both the spatial relation and the featural dependence are updated over time. Therefore, the passenger flow forecast in the metro is a time-series prediction problem.
Given a historical sequence of MFST tensors , the metro flow forecast task aims to predict a feature of all nodes in the subsequent T p timestamps denoted as Y = (Y t+1 , Y t+2 , . . . , Y t+T p ) ∈ R N×1×T p . These values can be represented by the forecasting model F with parameter , and the graph structures G s and G f are mapped from the data. (1)

Overview
Our proposed MFST-DMGCN model comprises TF-M, F-DMGCN, DMGCN, and GRU blocks. It addresses featural dependence, temporal connection, and spatial relation simultaneously. The highlight idea the proposed model is the design of the DMGCN block, which captures the various non-European relationships in MFST tensor. An overview framework of MFST-DMGCN is shown in Figure 4, the detailed process is illustrated in four steps. First, the raw tensors X ∈ R N×T h ×F are divided into several groups according to the types of features. These groups are respectively put into different TF-M blocks to capture the local trend information in a time-series sequence. The results of different TF-M blocks are stacked to produce high-dimensional representative hidden states H ∈ R N×D M ×T h ×F , where D M is the number of out channels in the TF-M blocks. Second, the hidden states are fed into the F-DMGCN block to dynamically capture the various feature relationships. Third, the result of F-DMGCN block O f ∈ R N×D M ×T h is fed into the S-DMGCN block to dynamically learn the relationships among space. Finally, the output of the S-DMGCN block O s ∈ R N×D M ×T h is fed into the GRU block to learn the temporal dependencies, and a final prediction result O ∈ R N×T p is obtained. To ensure the effectiveness of this model, the residual connection and layer normalisation are added in all blocks.

TF-M block
Owing to considering the local trend in time series and mapping more information, we designated the TF-M block as the first step of the model. The following example that demonstrates the difference between a traditional mapping approach and our TF-M approach. Figure 5(b) shows a continuous-time sequence in raw data; V1 and V2 are the same values in time C and F but present different time characteristic, i.e. a morning peak and an evening peak. Figure 5(a) shows the different mapping results between the aforementioned two mapping approaches.
The traditional mapping approach do not consider the characteristic of time C and F. In contrast, the TF-M approach learns the characteristic by considering the influence of A and B on C and the impact of D and E on F, thus the mapping result presents a more realistic temporal situation.
The details of the TF-M block are presented in Figure 6. The raw input data are firstly divided into four groups based on the types of features and projected by four parallel TF-M blocks; the results of each block are stacked to create the output as shown in Figure 6(a). Figure 6(b) shows the entire process of each TF-M block, which consists of two 1D-CNN layers connected by an activation function ReLU(•). The input of each block X ∈ R C i ×T h can be treated as the time sequence of each C i channel, a sliding window is employed to aggregates the K t neighbour moments in a time sequence. The convolution kernels in two layers are respectively 1 ∈ R K t ×C i ×C h and 2 ∈ R K t ×C h ×C o . The padding operation makes the input X ∈ R C i ×T h , the hidden state H ∈ R N×D M ×T h ×F and the output Y ∈ R C o ×T h (C o ≥ C i ) share the same length of time sequence T h .

DMGCN block
This section illustrates the DMGCN block which is the mainstay of the F-DMGCN block and the S-DMGCN as the second and third step in the MFST-DMGCN model. The DMGCN block can be expressed by the following four aspects, the attention fusion, the dynamic predefined GCN (DP-GCN) layer, the dynamic non-defined GCN (DN-GCN) layer and the attention fusion.
The attention mechanism is excellent in picking relatively influential parameters of the current state through learning the weights among the queries with all keys Vaswani et al., 2017). Self-attention is a particular part of the attention mechanism, whose characteristic is that the queries, keys, and values (Q, K, V) share the same dimension. The factor 1 √ D M is added to scale the dot product in the self-attention mechanism to effectively construct the global receptive field as shown in Equation (2) (Vaswani et al., 2017).
The DP-GCN layer is structured by a predefined graph topology that implies the connection between two relative vertices is fixed. Moreover, to capture the dynamic relative influence from the unchanged connections, the self-attention matrix S dp are employed as shown in Equation (3), that denotes the dynamic relationship strength between each two vertices.
Then, an element-wise dot-product operation is used to bind the self-attention score matrix with the adjustment matrix A. In different networks, V has different meanings. In the metro feature network, V implies that the features are equivalent as F in the input X dpf ∈ R T h ×N×F×C o , whereas in the metro spatial network, V implies that the stations are equal to N in the input X dpn ∈ R T h ×N×C o . The definition of the DP-GCN is as follows: To capture the various relationships in both spatial connection and featural dependency, a DN-GCN layer is designed to adaptively learn the connection between each vertex in the entire network. The same input H (m−1) is used in this layer to calculate the weight matrix based on the self-attention mechanism, as follows: The weight matrix W np is utilised as an adjustment matrix to dynamically capture the connection between vertices in the entire network. The definition of DN-GCN is as follows: This part is used to gather the outputs of the DP-GCN and DN-GCN layers to produce the output of the DMGCN block. Attention fusion is applied to dynamically integrate the results of the DP-GCN and DN-GCN layers; the definition of the output in the block is as follows: Out dmgcn = W p Out p + W np Out np where W p , W np separately donate the weight learned from the corresponding two outputs.

GRU block
The GRU is an improvement of the LSTM network that is simpler to calculate and implement. It effectively learns the temporal dependence by memorising the history information of the entire time series. The GRU block is proposed as the fourth step of MFST-DMGCN to extract the global temporal features from the output of the previous block. The concrete functions of this block are defined in Equations (8)-(11). The complete structure and details of this block are presented in Figure 7.
where h t−1 denotes the output at time t − 1, u t and r t represent the update and reset gates at time t, respectively, and the output at time t is h t . The weights and biases are represented by W and b in the operations, respectively.

Dataset
The dataset comprises information about the passenger flows in the Beijing subway recorded at 30 min intervals in 301 subway stations. Four types of flows are measured in the data: inflow, outflow, up-flow, and down-flow, which signify the number of passengers entering the station, leaving the station, travelling in one direction, and travelling in the opposite direction, respectively. Considering the impact of the Covid-19 outbreak in 2020, the period of these flows was from July to December 2019. We selected the daily data from 6:00 am to 11:00 pm because the passenger flow is zero between 11:00 pm to 6:00 am for most of the stations. The weights on several input timesteps are learned and determined by the model itself. Therefore, the impacts of weekends on weekdays are considered in the model. The data we selected from July to October in 2019 contained 122 days as the training set, November in the same year contained 31 days as the validation set, whereas the last 30 days as the test set. Additionally, the data were transformed using Z-score normalisation to eliminate background correlations. Moreover, considering the correlation in the metro network, a 301 × 301 spatial adjacency matrix and a 4 × 4 feature adjacency matrix were constructed separately to capture the spatial connection and feature dependency.

Settings
To avoid the evaluation results being influenced by the magnitude of the data, the mean absolute error (MAE), root-mean-square error (RMSE), and symmetric mean absolute percentage error (SMAPE) and weighted mean absolute percentage error (WMAPE) were adopted in all experiments as the evaluation metrics. The definitions are as follows: To make an equitable comparison in these experiments, the historical data over the past 6 h (12 steps) is utilise ed as the same inputs to predict passenger inflow among all stations for the next 30, 60, and 90 min, respectively.

Baseline models
To assess the performance of our model, six popular models widely used in the forecasting of passenger flow problems were selected as baseline models: MLP (Karlaftis & Vlahogianni, 2011): Multi-layer perceptron, a widely utilised ANN model. LSTM (Ma et al., 2015): Long Short-Term Memory network, a special RNN model. GRU (Fu et al., 2016): Gated Recurrent Unit network, a special RNN model. TGCN (Zhao et al., 2020): Temporal graph convolutional network, a combination with the graph convolutional network and the gated recurrent unit. STGCN (Li et al., 2018): A spatial-temporal graph convolution model based on the spatial method. ASTGCN (Guo et al., 2019): The spatial-temporal convolution and attention mechanism are employed to dynamically capture the spatial patterns and the temporal features. Graph WaveNet (Wu et al., 2019): The combination of graph and temporal convolutions to adaptively capture spatial-temporal dependencies.

Results analysis
To present the prediction results, three time slices with different prediction intervals are selected, specifically, 30, 60, and 90 min as examples. The ground truth and predicted passenger inflow are plotted with red and green lines respectively in Figure 8.
To illustrate the details more clearly, the results of thirty stations, from No. 135 to 165, were selected, and the errors between the ground truth and the prediction are drawn in blue in the enlarged graphs of Figure 8. The results show that our proposed model achieved accurate predictions for the whole subway system during the next 30, 60, and 90 min. The temporal connections are well captured by the proposed MFST-DMGCN model.
To show the prediction accuracy, four stations with different characteristics in space from December 2nd to 8th, 2019, were selected as examples with ground truth and prediction plotted using red and green lines, respectively in Figure 9. Guomao station (Figure 9(a)), Nanlishi Lu station (Figure 9(b)), Pingguo Yuan station (Figure 9(c)) and Beijing Railway station (Figure 9(d)) containing different space characters are respectively selected. The prediction accuracies of the four stations are nearly identical, which indicates the spatial relationships are well learned by the MFST-DMGCN model.

Comparisons with baseline models using different input data features
The feature that uses inflow as the only input data to predict the value of inflow was denoted "simple input" while the other using four types of flows containing the inflow, outflow, up-flow, and down-flow as the input data to complete the same prediction task was denoted "multiple inputs". Tables 1 and 2 summarise separately the average performances of the above two results.
We compare the improvements of MAE, RMSE, and SMAPE in different models by utilising simple input and multiple inputs in Figure 10(g)-(i) together with Tables 1 and 2. The results of utilising multiple inputs are mostly superior then applying simple input, which proves the advantages of considering featural dependence in metro passenger forecasting.
The results of the baseline models of MLP, LSTM, and GRU are less satisfactory than those of the others because of the lack of spatial relations and featural dependence considerations. The SMAPE and WMAPE results of STGCN perform better than the MLP, LSTM, and GRU in the 60-and 90-min durations particularly due to considering the spatial connections. However, they are less satisfactory than our proposed model. Finally, the SMAPE and WMAPE results of our model are slightly lower than the ASTGCN and Graph WaveNet because they fail to consider the effect of featural dependency. In general, the proposed MFST-DMGCN model achieves the best prediction performance specially in the 60-and 90-min durations.

Ablation experiments
Four variant versions of MFST-DMGCN were designed, with the same settings as MFST-DMGCN. The variants and corresponding differences are as follows: MFST-DMGCN  The results are shown in Table 3 and Figure 11. The MFST-DMGCN (no-TF-M) model performed considerably worse than the MFST-DMGCN model, proving that the block is excellent in gathering local information along a temporal dimension. Moreover, MFST-DMGCN is separately improves on the MFST-DMGCN (no-F-DMGCN) and the MFST-DMGCN (no-S-DMGCN), which denotes the necessary of these two components. Comparing the result of MFST-DMGCN (no-S-DMGCN) with that of MFST-DMGCN (no-F-DMGCN) shows that the effect of employing DMGCN in the domination of space is superior to using it in the domination of feature. Additionally, MFST-DMGCN achieves a better performance than MFST-DMGCN (no-GRU), indicating that the GRU block compared with two fully connected layers performs more effectively, especially in the next 60-and 90-min predictions.  Figure 11. Ablation analysis of MFST-DMGCN.

Conclusion
In this study, we obtained a deep comprehension of metro passenger flow forecasting tasks that deem metro passenger flow as an MFST tensor to consider temporal connection, spatial relation, and featural dependence. Subsequently, we proposed a novel model, MFST-DMGCN, that can effectively and dynamically capture the various relationships of the MFST tensor. The output of our model predicts the arrival crowds in every station. It provides data to support in three aspects: station operation decisions, train schedule timetable adjustment, and early warning of heavy congestion. Through the spatial and temporal relations among stations, the forewarning of large passenger flow could be made not only to the congested station, but also to the nearby stations. Meanwhile, this model could promote public safety by providing scientific reference on unscheduled massive crowds, congestion transmission prediction, and emergency passenger crowd limitation and metro timetable adjustment.
The MFST-DMGCN model comprises a TF-M block that obtains the local trend information in a time-series sequence and provides sufficient expressive power. Two DMGCN blocks -F-and S-DMGCN blocks -are used to consider the predefined and non-defined topologies to dynamically capture the connection between space and feature based on a self-attention mechanism. Comparisons of the MFST-DMGCN model on a real-world Beijing subway dataset with existing baseline models were conducted. The results of comparisons utilising different types of input data and ablation experiments verify the feasibility and the superiority of the proposed model. For the time dimension, a TF-M block and a GRU block are incorporated to deal with the prediction task with a time-series forecast. The predictions of multiple future periods (30, 60, and 90 min) can provide forewarning precaution to public safety systems with buffer time. For the space dimension, the prediction of metro passenger flow at the whole subway station offers a clearer understanding of the spatial distribution of the passengers so that the further planning of metro construction and resource allocation can be more reasonable for future urban constructions. For the feature dimension, this study innovatively added upflow and down-flow features to reveal more detailed information regarding passengers' travel behaviour. Indeed, busy metro stations are more possible to cause the occurrence of safety accidents. In future work, considering more external features such as the weather, pandemics, major holidays, accidents, or machine breakdowns could enable the model to make more accurate predictions. Replacing the GRU block with other approaches in the fourth step of the proposed model may also produce more accurate predictions. Further, some potential extensions could be trailed to handle other traffic domains such as the forecasting of highway traffic or bike-sharing demand predictions.

Disclosure statement
No potential conflict of interest was reported by the author(s).