Rail Transit Prediction Based on Multi-View Graph Attention Networks

,


Introduction
As the core function of the intelligent transportation system, traffic forecasting has practical significance for the actual needs of intelligent command and dispatch, traffic planning and layout, and public travel convenience. e prediction of passenger flow in and out of rail transit stations is one of the research hotspots in the field of smart transportation. An accurate passenger flow prediction method will be beneficial to the transportation system for reasonable route scheduling, road network design, crowd evacuation adjustment, and other specific applications. Most of the previous studies have focused on methods based on mathematical modeling as well as machine learning. However, in terms of rail transit, due to the unique topological structure of rail transit and the travel patterns of passengers, it is difficult to obtain efficient and accurate prediction results with the simple application of traditional methods, and related research is relatively limited.
In recent years, graph convolutional neural networks have achieved excellent performance in the field of traffic prediction by virtue of their excellent processing capabilities for non-Euclidean data. In fact, networks are ubiquitous in the real world, such as transportation networks, social networks, and recommendation networks. By modeling the network as a graph, subsequent prediction tasks can be performed.
e graph-based non-Euclidean topology not only describes the connection relationship between stations, but also constrains the flow path of data. erefore, the nongraph method can only make predictions for each station and average the prediction results, and cannot make full use of the topology of rail transit.
However, node relationships in the real world are more complex and contain many types of interrelated relationships. A view could represent a certain relationship. However, the node relationship information will be lost to an extent if only a single view is used for representation [1]. Multiple views can more accurately model different types of relationships, thereby ensuring that the model retains more comprehensive node information, which in turn enables more accurate node-level predictions. In rail transit, structurally, different types of lines and stations can be assigned to different view features. On the other hand, from the perspective of traffic flow characteristics, the pattern of passenger travel in different time spans can be viewed as different spatial-temporal features [2]. However, when the model contains multiple node relationships at the same time, how to ensure that the model integrates different node relationships with optimal weights to achieve more accurate prediction becomes a key issue.
Since the same node has a different importance in different views, the relationships between nodes in different views should be given different weights. Conversely, the same weights will negatively affect the final prediction and weaken the meaning of the information provided by multiple views. erefore, we design a multi-layer attention mechanism to achieve weight optimization for different views. In addition, during the training of the graph neural network, the problem of over-smoothing significantly affects the training effect as the number of network layers deepens.
at is, the hidden layer representation of each node converges to the same value during the training process of the graph neural network, which eventually leads to poor training results.
In response to the above problems, we propose a traffic prediction model based on multi-view graph attention network (MV-GAT), and its main contributions can be summarized as follows: (1) An end-to-end rail passenger flow prediction model is proposed. e proposed model achieves finegrained multi-view modeling for rail transit characteristics at the input and node-level prediction at the output.
(2) rough the multi-layer attention module, the proposed model can assign different weights to different nodes and relationships within multiple views, thereby learning the optimal regression of nodes. (3) In addition, the self-encoder module transfers the latent information captured by each layer of the selfencoder to the corresponding graph convolution layer, ensuring the validity of the structural information of each layer in the network, and further improving the effect of node prediction. e model is evaluated through experiments on the Beijing rail transit historical dataset, and the superiority of the model is verified by comparison with existing models. Furthermore, multi-view and multi-layer attention have good interpretability, as shown in ablation experiments.

Related Work
e research content of this paper mainly involves graph convolutional neural network and graph attention mechanism.
2.1. Graph Convolutional Networks. Graph convolutional networks (GCNs) are currently used in many domains such as traffic prediction [3], recommender systems [4,5], and traffic situation analysis [6]. On graphs, its tasks include graph classification [7], node classification [8], link prediction [9,10], and graph pooling [11]. GCNs have different kernels that learn node embeddings to be applied to downstream tasks. For example, DeepWalk [12] and node2vec [13] are both random walk-based methods. e model SDNE [14] uses autoencoders to maintain the proximity of first-and second-order networks, using highly nonlinear functions to obtain embeddings. Existing traffic flow forecasting techniques include traditional mathematical modeling methods, such as ARIMA [15], as well as deep learning methods. Among them, deep learning methods are subdivided into nongraph-based methods, such as LSTM [16], and nongraph-based methods, such as GCN models. Traditional mathematical modeling methods as well as nongraph-based methods do not consider the topology of the graph and can only make individual predictions for individual sites. Deep learning methods based on graphs can achieve node-level prediction, but currently the mainstream methods are mainly single view [17].
Single-view graph neural networks contain only one relationship between nodes [18]. Although single view has many advantages, such as easy to understand and easy to design neural network models, it is difficult to accurately capture the complex relationships between nodes, which play a crucial role in the effectiveness of information transfer and problem solving [19]. It has been pointed out that graph data possess similarity information between different nodes, which in turn has been proposed to preserve similarity information in the hidden layer of graph convolutional neural networks [20]. However, these methods rarely exploit the multi-view prediction in end-to-end network models.

Graph Attention Mechanism.
e attention mechanism was first proposed for natural language processing and has now been widely used for many sequence-related tasks. e advantage of the attention mechanism is that it can amplify the impact of important parts of a sequence, and the introduction of the attention mechanism also facilitates the use of graph neural networks. Because graph convolutional networks rely on the eigenvalues of the Laplacian matrix, it is difficult to extract convolutional operations from the overall static graph structure. In an attention network, the output at a given moment depends on the attention it allocates across multiple inputs, i.e., the learning weight assigned to each part of the input, with larger weights implying the output of the pair at that particular moment.
As the attention mechanism in the seq2seq model [21], each output is affected by the different weights assigned to the different inputs. e concept of hard attention [22] is designed as a stochastic process that uses Monte Carlo sampling methods to estimate the gradient of the module, thus enabling back-propagation of the gradient. In addition, attention mechanisms include global attention and local attention [23], as well as multi-headed attention [24]. Multiheaded attention is used to extract features more comprehensively by mapping node representations into multiple node representations through linear mapping and combining the computational results. Inspired by the above work, the possibility of using a multi-layer attention mechanism to fuse multi-view information to reveal the deep relationships between nodes becomes one of our considerations.

Methodology
e necessary preliminaries are firstly illustrated, followed by introducing of the overall architecture of the proposed model, and then the details of each component are elaborated.

Preliminaries.
is section will introduce some concepts and symbols used in this paper. For a regular graph G with vertex set V, the edge set E and weight W can be denoted as For the vertices in graph, the degree is defined as the sum of all weights connected to the vertices; for the edges in graph, the degree is defined as the total number of vertices connected by the edge: In the process of modeling information in real life, usually only a single view is used to represent the relationships between nodes. A single view contains only one relationship, but due to the complex relationship in real life, it is difficult to capture the comprehensive node relationship with only one view, which will inevitably lead to the omission of information, which will lead to deviations in the subsequent processing of the model. A multi-view contains various relationships between nodes. It can capture structural information more accurately than a single view and better discover implicit relationships between nodes.
us, a multi-view graph can be denoted as indicates the m-th view, node i is connected to node j, and x i ∈ X denotes node feature v i . e node structure in Graph G can be represented by multiple adjacency matrices In our work, the connection between the node and itself is not considered, i.e., a (m) i,i � 0. e purpose of the work is to predict traffic flow with the proposed model. e input of the model is the historical where N indicates the total number of vertices, C is the number of channels of the feature, and T is the time dimension. At the output end of the proposed model, node-level prediction results are supposed to be obtained, which can be denoted as

MV-GAT: e Proposed Model.
For complex relationships between entities in the real world, it is difficult to fully grasp the node structure information if only a single view is used to represent the node relationships. In rail transit, considering only the line connections between stations ignores the relationships between stations at the feature level, such as the OD characteristics of passenger trips between stations with different time spans. During the morning and evening peak hours, large passenger trips show relatively fixed patterns, which can also be used as a view for traffic flow prediction. At the same time, it is important to avoid the problem of premature model fitting as the number of layers of the network model increases. When the model uses multiple views as input, how to fuse these views becomes a new problem. e fusion process must ensure that the model can ignore noisy information and that the most relevant information of the nodes is extracted among the multiple views.
To address the above issues, we propose the overall framework of the model, as shown in Figure 1. e basic idea is to use the multi-layer attention module to capture the node information contained in the multi-view to ensure that the best node representation can be learned, and to use the autoencoder module to ensure that the model learns the structural information between the data, which is represented as a multi-view graph.
In the multi-view module, multiple views are used to ensure complete information extraction. Specifically, in this forecasting task, the multiple views include a static view based on the connectivity of tracks and routes, and an OD view of passenger flows for three different time spans: hourly, daily, and weekly. e autoencoder module learns the accurate data representation and mitigates the over-smoothing problem. e two parts of the input are connected to the autoencoder module and the GCN module, and each layer in the autoencoder module is guaranteed to be connected to the corresponding GCN layer, so that the structural information between nodes learned in the autoencoder can be integrated into the GCN module.
In the multi-layer attention module, multi-layer attention is used to fuse the multi-view information to obtain an optimal representation of the data. e multi-layer attention module ensures that the model learns different weights at different nodes and in different views.

Multi-View Graph Convolution.
For the single-view graph, the input is G k � (A k , X). e multi-view graphs generated by the relationship between the nodes are where m is the number of views. Each input is fed into an exclusive convolution module. e output of the convolution is Z k and Z m . Take Z k ; for example, the output of the l-th layer of the graph convolution can be expressed as Journal of Advanced Transportation where W (l) is the weight matrix of GCN at the l-th layer, the preliminary Z (0) m X (m) att , and X (m) att is the node embedding learned by single-view attention network in view m.
It is di cult for multi-view convolution to learn the commonality between di erent views only by learning each view individually, so multi-view convolution is supposed to be added to extract common information between di erent views.
e proposed model uses previously constructed input graphs G k and G m as inputs to multi-view convolution, the output of multi-view convolution module is Z c , and the output of the l-th layer of the convolution can be expressed as where W (l) is the weight matrix of the l-th layer of GCN, the preliminary Z is Z (0) X, A A + I, and D is the diagonal matrix of A.

Autoencoder
Module. e proposed method introduces an autoencoder to learn the structural information of the data and pass the learned information to the corresponding GCN layers, and the added autoencoder module also helps to alleviate the over-smoothing problem of the GCN.
Assuming that the autoencoder has L layers, the expression learned in the l-th layer in the autoencoder is H (l) : In the formula, ReLU is the activation function of the fully connected layer, and W (l) e and b (l) e are the weight matrix and bias of the l-th layer in the autoencoder. In addition, H (0) is the feature matrix X.
en, the input data of decoding part are reconstructed through the fully connected layer.
Here, W (l) d and b (l) d are the weight matrix and bias of the l-th layer of decoder. In order to pass the node representation into the GCN module, the node representations are learned from the autoencoder, such as H (1) , H (2) , . . . H (L) . After being passed into the GCN module, the GCN can hold two di erent kinds of information, the data itself and the data structure. For example, the output of l-th layer learned in the single view can be expressed as Z (l) k . e representation H (l) learned by the autoencoder can reconstruct the data itself and contains a di erent valuable information. Combining the two representations leads to a more complete representation.
Here, ϵ is the balance coe cient with an initial setting of 0.5. In this way, the autoencoder and GCN can be connected layer by layer. We use ReLU as the activation function to solve the gradient vanishing problem.

Multi-Layer
Attention. Since the model takes multiple views as input, the proposed method designs a multi-layer attention module to e ectively integrate the node representations learned in di erent views to form an optimal combination. First, the proposed method uses a single-view attention layer to learn the in uence of di erent neighbor nodes on the predicted node in the same view. en, a multiview attention layer is used to learn the in uence of di erent views on the predicted node. Finally, the two parts are combined to obtain the optimal weighted combination of the nodes to be predicted.
In the single-view attention layer, the in uence of different neighbor nodes on the predicted node in each view can be learned. Since each node plays a di erent role in the process of node embedding, the impact on the nal node prediction result is also di erent. Self-attention is thereby used to learn the weights between each node. For instance, in the view m, calculating the attention index of a pair of nodes (i, j) can be formulated as Here, att represents the attention mechanism, and since the multiple views are undirected graphs, the importance of node i to node j is the same as node j to node i. erefore, e (m) ij is a symmetric matrix. After calculating the e (m) ij of node j, the weight coe cient is normalized as In the equation, represents the connection operation, and a T m is the attention vector in the single view. e node embedding of node i in the view can be obtained by the feature aggregation of neighbor nodes with feature Traffic Network View L res  coefficients. Multi-head attention is utilized in order to make the training process more stable. Softmax and ReLU are both activate functions. Specifically, the single-view attention layer repeats K times and connects the learned embedding to a specific view. e learned node embedding and feature matrix are spliced to get X (m) att . In the following equation, z (m) i is the embedding of node i learned in the view m.
A single view contains only one type of relationship between nodes, while a multi-view contains relationships between different nodes. To learn more comprehensive node embeddings, it is necessary to integrate multiple node embeddings learned from different views. For different nodes or associations, the weights assigned to different views are different, so it is necessary to design a multi-view attention layer that automatically assigns different weights to different views to solve this problem. e input of multi-view attention layer is the single-view graph convolution Z k and Z (m) and the multi-view convolution Z c , and the attention mechanism att(Z k , Z (m) , Z c ) learns the weights corresponding to different views (α k , α (m) , α c ): Here, α k , α (m) , α c are the attention weights of different views, respectively. For node i, a nonlinear transformation is applied on the node embedding, and then the shared attention vector q is taken to calculate the attention value ω i m .
e W is weight matrix and b is bias. e attention index of node i in other embedding matrices can be obtained in the same way. en, the final weight can be calculated by normalizing multiple attention values.
e multiple embeddings are then linearly combined. e larger the α i m , the more important the view is.
e above multi-view attention module solves the problem of assigning different weights to the views, thereby enabling adaptive inter-view importance learning.

Objective Function.
In order to allow the convolution to capture richer information, we increase the difference between Z k , Z m , Z c . Here, we take advantage of the Hilbert-Schmidt independence criterion (HSIC) to measure the independence between the outputs: Here, . And R � I − 1/nee T . I is the identity matrix, and e is the corresponding identity column vector. In the same way, all other views are also calculated by HSIC, denoted as L s . e multi-view loss function is supposed to learn as much consistency between different views as possible. After normalizing the matrices Z ci 4 i�1 to Z cinor 4 i�1 with L2 normalization, the similarity between nodes S i i�1 is calculated, and the sum is denoted as L m .
Since in the autoencoder module, the output of the decoder is the reconstructed original data. e node-level traffic flow prediction results will be output through a complete fully connected layer, and the multi-channel is mapped to a single channel, which can be expressed as e final loss function is L, where a, b are the parameters.

Experiments and Results
e proposed MV-GAT model is evaluated by comparing it with state-of-the-art baselines. e experimental dataset and baselines are first introduced, followed by the parameter setup. Finally, the experimental results and experimental analysis are presented.

Experiment Setting.
We adopt the historical data of Beijing metro as the experimental dataset. MetroBJ [25] is a five-month passenger flow dataset, formally collected in 2015, with a granularity of 5 minutes. e dataset covers the entire subway network with 325 stations and 22 lines, covering the daily traffic data in July, August, September, November, and December. e time horizon is five months, covering weekdays and weekends. is time series contained in this dataset is long enough for us to divide multiple time spans to build multiple feature-level views. e dataset contains the desensitized swipe ID, the line station and time of entering the subway, and the traffic flow data of the line station and time of leaving the subway. In the actual use of this method, firstly, a node set containing 325 nodes is constructed based on the subway stations in Beijing in this dataset, and a basic view containing 22 edges is constructed with reference to the subway network lines. On this basis, the DBSCAN algorithm is used to cluster the historical passenger flow data under three different time spans of hours, days, and weeks, and construct corresponding multi-views. Compared with the traditional Journal of Advanced Transportation k-means algorithm, the DBSCAN algorithm does not need to input the number of clusters k and can find clusters of any shape, and at the same time, it can find outliers during clustering. Finally, the traffic flow values of each node in the next 5 minutes, 10 minutes, and 15 minutes are output to calculate the accuracy of the proposed model. e comparison methods include two categories of nongraph methods and graph-based methods. e compared methods contain autoregressive integrated moving average (ARIMA) model [26], support vector regression (SVR) [27], and long short-term memory (LSTM) [28]. Graph-based deep learning methods contain temporal graph convolutional network (T-GCN) [29], spatio-temporal graph convolutional network (STGCN) [30], and diffusion convolutional recurrent neural network (DCRNN) [31]. e detailed parameter settings are listed as follows.
(1) ARIMA: ARIMA is a common time series forecasting methods. e degree of differencing d, lag order p, and the order of moving average q are determined with the "auto arima" in the "pyramid" library. e diffusion convolutional recurrent neural network is a data-driven traffic prediction model with autoencoder framework. It has two RNN layers of 64 units. e batch size is set to 64, and the learning rate is set to 10 − 3 .
To quantitatively evaluate the prediction accuracy of the proposed method, the results of the experiments take mean absolute error (MAE) and root mean square error (RMSE) as performance metrics: where X ij is the ground truth, the X ij is prediction value, T is the time length, and N is the node number. When MAE and RMSE are used as evaluation indicators, the lower the value, the higher the accuracy. All experiments are tested with the platform of CPU of "Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90 GHz" and GPU of "NVIDIA GTX 2080Ti." e number of epochs of training phase is 50, and the batch size is 64. e learning rate is set to 10 − 2 and decreases to 10 − 4 gradually.

Experiment Results.
To fully utilize the different views over multiple time spans, we use the data of a whole month as the experimental data. e experiments use ten-fold cross-validation to get stabler results. Considering the size of dataset per month, the training set, testing set, and valid set are split with 8 : 1 : 1 on the time dimension. e experimental results are shown in Table 1. Table 1 shows the prediction accuracy when the historical data of July and September are used as the experimental dataset. As can be seen from the results, the accuracy of the ARIMA method is significantly lower than that of the machine learning and deep learning methods. SVR significantly outperforms ARIMA, and at the same time, LSTM is better than SVR by virtue of modeling long-and short-term sequences. T-GCN, an earlier method that combines graph networks with time series dependency, achieves similar accuracy to the relatively mature LSTM.
As a classical framework, STGCN has achieved more accurate prediction results, especially in the medium-term prediction of the next 45 minutes, where obvious advantages can be seen. With a unique architecture, DCRNN also achieves accurate results. Among all methods, our proposed method achieves better accuracy, especially on short-term predictions of 15 minutes and 30 minutes. Compared with machine learning methods and graph-based deep learning methods, there are significant improvements. More experimental results of flow prediction are shown in Table 2.
It can be seen from Table 2 that the prediction accuracy of each method is similar to that presented in Table 1. It shows that the rail transit shows a basically stable operation law in each month. It is worth mentioning that, similar to the previous set of experiments, STGCN achieves a clear advantage in 45-minute prediction results. It reflects the complexity of traffic forecasting from the side. In many cases, it is difficult to solve short-term forecasting, mediumterm forecasting, and even long-term forecasting problems simultaneously with one model.
To prominently compare the role of each module of the proposed model, we design a set of ablation contrast experiments, as shown in Table 3. In this set of ablation experiments, we mainly compared the difference between single view and multi-view, and the role of the autoencoder. e experimental data adopt the passenger flow data of Beijing rail transit in July. We first tested the single-view network model without the autoencoder module. e single view is the graph of the rail transit network. While removing the autoencoder, other parts of the proposed model remain unchanged. It can be seen that the prediction accuracy of this method is unsatisfactory, and it cannot even beat the STGCN model on this dataset. In the case of single view, whether the multi-layer attention mechanism has the effect of negative optimization is a new problem worth investigating.
By adding the autoencoder module to the single-view model, the prediction accuracy is improved, but the improvement is relatively limited. e autoencoder module can alleviate the gradient vanishing problem during training to a certain extent, especially for graph convolutional deep network models with many layers. Limited by the graph scale of the dataset used in this experiment, the number of layers in the network model is not many. erefore, in the deeper graph convolution prediction model, it is worth looking forward to whether the autoencoder module can play a larger role.
After the introduction of multi-view, the prediction accuracy of the model is significantly improved compared to single view, with or without an autoencoder module. Among them, the model achieves the best prediction results when the multi-view module and the autoencoder module coexist.

Conclusions
is paper proposes a multi-view and multi-layer attentionbased GCN model for the problem of rail traffic flow prediction. Considering that it is difficult to fully express the relationship between nodes in the node classification problem using only a single view, this model introduces multi-view and utilizes a multi-layer attention mechanism and an autoencoder module to achieve more accurate temporal prediction. Experimental results on the Beijing dataset show that our model outperforms other nongraph and graph-based benchmark methods. In the future, we will optimize the framework of the proposed method and try to design models for directed graphs. We also want to explore more comprehensively the application of graph-based deep learning in intelligent transportation systems.

Data Availability
e data supporting this proposed model are from previously reported studies and datasets, which have been cited. e processed data are available from the corresponding author upon request.