Spatial-temporal Attention Fusion for Tra�c Speed Prediction

: Accurate vehicle speed prediction is of great significance to the urban traffic intelligent control system. However, in terms of traffic speed prediction, the modules that integrate temporal and spatial features in the existing traffic speed prediction methods are effective in short-term prediction, but the medium-term or long-term prediction errors are relatively large. Aiming at this limitation, this paper proposes a traffic speed prediction method that combines attention and Spatial-temporal features, referred to as ASTCN. Specifically, unlike previous methods, ASTCN can use the temporal attention convolutional network (ATCN) to separately extract temporal features from the traffic speed features collected by each sensor, and use the spatial attention mechanism to extract spatial features and then perform spatial-temporal feature fusion. Experiments on three real-world datasets show that the proposed ASTCN model outperforms the state-of-the-art baselines.

Transportation plays a vital role in everyday life. According to a 2015 survey, the average driving time of American drivers is about 48 minutes per day [1]. The intelligent control of urban traffic is very important, and traffic speed prediction has been paid more and more attention to the intelligent control of traffic. Traffic speed prediction is using the known road network structure and historical time step traffic speed data to predict the traffic speed at future time steps. The time step length of traffic speed prediction can be divided into three types, short-term prediction (within 30 minutes), medium-term prediction (30 minutes to 60 minutes), and long-term prediction (over 60 minutes). In the past four decades, due to the increasing demand for urban traffic intelligent control system technology, traffic intelligent control system can not only provide drivers with accurate information but also can be used for signal optimization and vehicle coordinated control. Therefore, traffic speed prediction has always been hot research [2]. If it can predict accurately in advance, the traffic management department can guide the vehicles more reasonably and improve the operating efficiency of the road network. However, due to complex temporal and spatial features, accurate traffic speed prediction is a challenging problem.
Traffic speed prediction is a classic problem of spatial-temporal data prediction. The traffic data is recorded at a fixed point in time and a fixed location with the continuous spatial distribution. Obviously, observations made at adjacent locations and adjacent time points are dynamically related to each other, as shown in Fig 1. The correlation of road network traffic data shows strong dynamics in both spatial and temporal dimensions. Therefore, the key to solving the problem of dynamic prediction based on the existing conditions is how to effectively extract the temporal and spatial features, and effectively integrate them to predict the traffic speed. How to mine non-linear and complex spatial-temporal data, discover its inherent spatial-temporal patterns, and make accurate traffic speed predictions is a very challenging problem. In fig 1, it can be seen that with the time going by, the speed of traffic at each intersection will be affected by the traffic conditions of the previous time step of the intersection (the brown thick arrow in the vertical direction) and the traffic conditions of the adjacent intersection (the red thin arrow in the transverse direction). In short, the correlation of road network traffic data shows a strong dynamic in both spatial and temporal dimensions.
With the development of traffic, many information collection devices have been deployed to the road network, so that we can directly use the information collected by these devices to predict the traffic speed. Many researchers have made great efforts to solve these problems. In the early days, the time series analysis model was used for traffic prediction. However, in practical applications, they are difficult to deal with unstable nonlinear data. Later, traditional machine learning methods were developed to model more complex data, but they are still difficult to consider the spatial-temporal correlation of high-dimensional traffic data at the same time. Deep learning (DL) is an effective tool in big data analysis. It can automatically identify patterns and features in complex data through unsupervised/supervised learning. In recent years, many researchers have been using some deep learning methods to process high-dimensional spatial-temporal data, that is, convolutional neural network (CNN) is used to extract spatial features of grid data effectively; Graph convolution neural network (GCN) is used to describe the spatial correlation of graph-based data. ChebNet [18] is a powerful GCN, which uses Chebyshev extension to reduce the complexity of Laplacian computation. GraphSAGE [19] samples a fixed number of neighborhoods for each node in the graph, and aggregates its neighborhood and its own elements. GAT [20] [20]is a powerful variant of GCN defined in the vertex domain, which uses the attention layer to dynamically adjust the importance of neighbor nodes.
In order to make full use of spatial features, some researchers use a convolutional neural network (CNN) to capture the adjacent relationship between traffic networks, and use the recurrent neural network (RNN) on the time axis. By combining Long short-term memory (LSTM) network [3] with one-dimensional CNN, Wu, and Tan [4], a feature-level fusion structure CLTFP for short-term traffic prediction is proposed. Later, Shi et al. [5] proposed the convolutional LSTM, which is an extended all connected LSTM (FC-LSTM) embedded in the convolution layer. Zhang et al. [6] designed an ST-RESNET model based on a residual convolution unit to predict crowd flow. Yao et al. [7] proposed a traffic volume prediction method combining CNN with Long and short-term memory (LSTM), which combined spatial and temporal correlation modeling. Yu et al. [8] proposed a new deep learning framework spatial-temporal graph convolution network (STGCN) to solve the problem of time series prediction in the field of transportation. Li et al. [15] proposed the diffusion convolution recurrent neural network (DCRNN), which introduced graph convolution network into spatial-temporal network data prediction, and used diffusion graph convolution network to describe information diffusion process in the spatial network. Guo et al. [9] proposed a deep learning traffic prediction framework based on graph attention network (GAT) and time convolution network (TCN), called graph attention temporal convolution network (GATCN). Zhao et al. [10] proposed a new traffic prediction method based on neural network, the temporal graph convolution network (T-GCN) model. Song et al. [13] proposed a new Spatial-Temporal Synchronous Graph Convolutional network (STSGCN). Guo et al. [14] proposed a deep spatial-temporal 3D convolutional neural network (ST-3DNet), which introduced three-dimensional convolution into this field. Wu et al. [16] designed an adaptive matrix to consider the change of influence between nodes and their neighbors. Bai et al. [17] attempted to simultaneously model spatial-temporal correlation by using gating residual GCN module with two attention mechanisms. Kong et al. [21] proposed an end-to-end deep learning based dual path framework, Spatial-Temporal Graph Attention Network (STGAT). However, the above methods are effective in the short-term forecast, and the error is large in the medium-term or long-term forecast. Zheng et al. [11] proposed an attention-based encoderdecoder framework, which computes spatial attention scores from all vertices, but consumes a lot of time and memory.
In order to address the above problems, we proposed the ASTCN method, the main contributions are as follows: 1. In traffic speed prediction, the length of historical time steps and future time steps are regarded as significant factors, and Temporal attention convolutional networks are used to extract the traffic speed features observed by each observation device.
2. The revised attention mechanism is used to extract spatial features.
3. The spatial-temporal feature fusion (FST) module is used to fuse spatial-temporal features.
The rest of this paper is as follows: Section 2 gives a description and some definitions of the traffic speed prediction problem. Section 3 introduces the architecture of ASTCN for traffic speed prediction. Section 4 is the experiment, and Section 5 is the conclusion and future work.

2.Problem Setup
This section introduces the transportation network structure, the description of the traffic speed prediction problem and the structure of the input and output data.

Transportation Network Structure
In this paper, we use an undirected graph = ( , , ) to define the transportation network, where V is a finite set of| | = vertices, corresponding to the number of observation devices in the transportation network; E is the set of edges, indicating the connectivity between observation points; A represents the weighted adjacency matrix of G. If the observation device i and the observation device j are directly connected, the value of is the cost  is the speed observed by the observation device n at time t.

Traffic Speed Forecast
Traffic speed prediction is a typical time series prediction problem. Traffic speed prediction is based on the current and historical situation of the road network, plus some objective conditions (such as road network structure, weather conditions, emergencies and other factors) to predict the traffic speed in the future.
Therefore, the traffic speed prediction problem can be regarded as learning the mapping function on the premise of knowing the road network structure G and the traffic speed matrix X, and then calculating the traffic speed at time T, as shown in Formula 1.

The structure of Input Data and Output Data
The input data of ASTCN traffic speed prediction model includes weighted adjacency matrix and historical step traffic speed matrix. The output data structure of this model is the traffic speed matrix of prediction time step.
The error value of the model is calculated by comparing the predicted result of the model with the real data.

Methodology
This section introduces the ASTCN network structure and its details, including spatial-temporal convolution block and fully connected output layer.

Model Framework
In this part, we elaborate the structure of ASTCN. As shown in Fig 3, ASTCN contains two spatial-temporal convolution blocks and a fully connected output layer. Each spatial-temporal convolution block contains temporal attention convolutional network (ATCN), spatial attention network and spatial-temporal feature fusion module (FST). We add an attention mechanism to extract temporal features on the basis of temporal convolutional network (TCN), which is named ATCN. And we use spatial-temporal feature fusion module (FST) to fuse the extracted temporal and spatial features.

Spatial-Temporal Convolutional Block
Spatial-temporal convolutional block can capture the dynamic spatialtemporal correlation in road network. And it includes spatial attention network, temporal attention convolution network and spatial-temporal feature fusion module.

Temporal Attention convolutional Network
Different historical time steps have different effects on the prediction results. In traffic speed prediction, the length of historical time step is regarded as a significant dependent variable, and the length of future time step is a significant indicator to measure the accuracy of the model. In the section 4, we do a comparative experiment with different lengths of historical time steps.
In the temporal dimension, the traffic speed at the current moment is dynamically affected by the traffic speed at the historical moment. Here we use the temporal attention convolutional network to extract the temporal traffic speed features observed by each observation device. In this module, we dynamically extract the temporal correlation, as shown in Formula 2.
Among them, 1 ∈ ℝ × is a learnable weight matrix, is the historical time step input in the experiment, is the speed set observed by the observation device i, and N is the number of observation devices in the road network, (⋅) represents an activation function. Here, the activation function is a ReLU function, as shown in Formula 3.
(⋅)is a temporal convolutional network. The specific formula 4 is as follows:  The architecture in Temporal Convolutional Network (TCN) [12] is a causal convolution, that is, no information is leaked from the future to the past during model training. At the same time, this architecture can use sequences of any length and map them to sequences of the same length, in similar to RNN. We can give TCN an input sequence 0 , 1 , … , , and then hope that TCN will output the related results 0 , 1 , … , , and generate a mapping relationship, which is named f function: 0 , … , = ( 0 , … , ), The value of here only depends on 0 , … , and has nothing to do with any +1 , … , . The goal of structural learning for sequence modeling is to find a f function mapping that minimizes the expected loss between the actual output and the prediction.
In addition to causal convolution, TCN also has a principle that the length of the input sequence and the output sequence are the same. TCN uses a onedimensional fully connected network to meet this principle, that is, the number of neurons in each hidden layer in the network is the same as the number of input layers, and zero padding with a length of core size-1 is added to maintain the same length of subsequent layer and previous layer. We can use = 1 + to briefly describe the characteristics of TCN. In this paper, the experiment only needs to be input the traffic speed sequence of 24 historical time steps to predict the traffic speed of 24 future time steps. Therefore, the length of the historical time step that needs to be input is relatively short, so we do not use the expansion convolution of TCN. The TCN structure used in this article is shown in Fig 4.

Spatial Attention Network
In the spatial dimension, the traffic speed of the current location is affected by the dynamics of the neighboring locations. Here we use a revised attention mechanism to capture the dynamic correlation between different nodes in the spatial dimension. In this module, we dynamically capture the spatial correlation as shown in formula 5. We use two learnable weight matrices 2 , 3 to multiply the road network weight matrix to obtain a tensor with the same dimension as the input tensor of the fully connection output layer.
= (( 2 × ) × 3 ) Among them, ∈ ℝ × is the standardized road network weighted adjacency matrix, 2 ∈ ℝ × × ℎ × and 3 ∈ ℝ × are the learnable weight matrices, N is the number of observation devices in the road network, I is the input dimension of the convolution, B is the number of each batch of data in the experiment, ℎ is the historical time step input in the experiment, O is the output dimension of the convolution, (⋅) is the activation function.

Spatial-Temporal Feature Fusion Module(FST)
In order to make full use of the temporal and spatial features extracted by the above method in ASTCN model, we need to fuse the temporal features and spatial features. Zheng et al. [11]design a gated fusion to adaptively fuse the spatial and temporal features. In this paper, we modify this method by adding a learnable weight matrix W, which is used to make the tensor dimension of the temporal feature T consistent with the spatial feature S. The traffic speed of a road at a specific time is related to its previous traffic speed and the traffic speed of adjacent roads. In this paper, we propose a spatial-temporal feature fusion method, the specific method is shown in Fig 5. = ( × 4 ) + (6) Among them, 4 ∈ ℝ × is a learnable weight matrix, where the temporal characteristic matrix and the weight matrix 4 multiplication is to make the tensor dimension consistent with the spatial feature , (⋅) is the activation function. We add them together to get the spatial-temporal features, and then proceed to the next operation.

ASTCN training algorithm
The training process of the ASTCN is shown in the Algorithm 1. Fusing the spatial-temporal features: = (( × 4 ) + ) +1 = End for X pre = fullyOutput(X 3 ) return X pre

Experiment
In this section we describe datasets, baseline methods, evaluation metrics, and comparison results.

4.1Datasets
We evaluated the traffic prediction performance of ASTCN on three real datasets. The three real datasets are PEMS04, PEMS08 [13] and LOS [10].
PEMS04 and PEMS08 are collected by Caltrans Performance Measurement System. The Caltrans Performance Measurement System collects data sets in real time every 30 seconds. And the traffic data is aggregated from the original data every 5 minutes. The system deployed more than 39,000 detectors on highways in major metropolitan areas in California. And the geographic information of the observation device has been recorded in the dataset. The LOS dataset is collected in real time from Los Angeles highways through loop detectors. This dataset is similar to PEMS in that the traffic speed is collected every 5 minutes.
And the three datasets are composed of adjacency matrix and speed feature matrix. The specific details and the traffic speed distributions of the three datasets are shown in Table 1 and Fig 6, respectively. In this paper, each dataset is composed of an adjacency matrix data set and a traffic speed data set. Among them, the adjacency matrix data represents the distance of each observation device, and each column of the traffic speed matrix corresponds to the traffic speed collected by each observation device in the adjacency matrix. We standardize the adjacency matrix by formula 7, and use formula 8 [8] to normalize the traffic speed matrix.
Among them, ∈ ℝ × is the adjacency matrix; √∑ , = , ∈ ℝ 1× represents the sum of each column of matrix A; ′ is the normalized adjacency matrix.
Where ∈ ℝ × is the traffic speed matrix, P is the total number of minutes of the datasets divided by 5, which corresponds to the observation timestep of the observation devices; N corresponds to the number of observation devices, ′ is the standardized traffic speed matrix, mean (X) and STD (X) correspond to the mean and standard deviation of the historical time series, respectively.

Baseline Method
During the verification test stage, the ASTCN model we proposed will be compared with the following two methods in terms of traffic speed prediction.
：Spatial-Temporal graph convolutional network [8] mainly uses graph convolutional network and two-dimensional convolution to extract spatial and temporal features respectively.
T-GCN：the temporal graph convolutional network [10] uses graph convolutional network and GRU to extract spatial and temporal features respectively.

Evaluation Metrics
In this paper, we use three metrics to evaluate the prediction performance of different traffic speed prediction models. They are the Mean Absolute Error (MAE), the Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE), which are represented by Formula 9, Formula 10, and Formula 11.
The range of MAE、RMSE and MAPE is [0, +∞). The three metrics are 0 when the real value and the predicted value are equal, which is a perfect model. A value of MAPE exceeding 100% is indicated as an inferior model.  Table 2 show the comparison of the three methods for 24 timesteps future predictions, which include ASTCN, STGCN and T-GCN, in three evaluation metrics on PEMS04 dataset. And the three metrics are MAE, RMSE and MAPE.  The above experimental results show that in PEMS04 dataset, the prediction error of ASTCN model is lower than that of STGCN and T-GCN models. For example, when the prediction time step length is 12, the prediction error MAE of ASTCN model is 1.935，but the prediction error MAE of the other baseline model are 2.787 and 2.583 respectively.  Table 3 show the comparison of the three methods for 24 timesteps future predictions, which include ASTCN, STGCN and T-GCN, in three evaluation metrics on PEMS08 dataset. And the three metrics are MAE, RMSE and MAPE.  The above experimental results show that in PEMS08 dataset, when the prediction time step lengths are 3, 6,9,12,15,18,21,24, the prediction errors of ASTCN model are lower than that of STGCN and T-GCN models.  Table 4 show the comparison of the three methods for 24 timesteps future predictions, which include ASTCN, STGCN and T-GCN, in three evaluation metrics on LOS dataset. And the three metrics are MAE, RMSE and MAPE.  For example, in LOS dataset, when the prediction time step length is 15, the prediction error RMSE of ASTCN model is significantly lower than that of the other two models. Therefore, there is no denying that ASTCN model outperforms STGCN and T-GCN model in traffic speed prediction.

Choosing Historical Timestep
In order to choose a more appropriate length of the historical time step, we designed a comparative experiment, which sets the length of the historical time step to 24(2 hours), 36(3 hours), and 48(4 hours) respectively to compare the error of the prediction results on PEMS04 dataset. Fig 10 and Table 5 show the comparison of the three lengths of historical time steps for 24 timesteps future predictions in three evaluation metrics on PEMS04 dataset. And the three metrics are MAE, RMSE, and MAPE.  In above comparative experiment, we found that with the increase of the length of the historical time step, the medium and long-term prediction error of traffic speed decreased, but the short-term prediction error of traffic speed increased. When future time step length of the predicted traffic speed is 3 and 6, the prediction error of the model with historical time step length of 24 is lower than that of the model with historical time step length of 36 and 48, but when future time step length of the predicted traffic speed is more than 12, the prediction error of the model with historical time step length of 24 is higher than that of the model with historical time step length of 36 and 48. Therefore, in the short-term prediction of traffic speed, we should set the historical time step length to 24, that is, 2 hours; In the medium and long-term prediction of traffic speed, the length of historical time step should be set to 48, that is, 4 hour.

Model Interpretation
In order to better understand the ASTCN model, we chose an observation device in pems04 dataset, in this test set, and visualized the prediction results and actual traffic speed. Fig 10 shows the visualization results with the predicted horizon of 15 minutes, 30 minutes, 45 minutes, 60 minutes, 75 minutes, 90 minutes, 105 minutes and 120 minutes. With the increase of prediction time step, the worse the prediction effect is, which accords with the actual situation. In the fig 11, the "out_y" denotes the test set data and the "out_pre" denotes the prediction result. The titles of these pictures, for example "PEMS04_24_15_traffic_speed", the first number 24 is the historical timestep and the second number 15 is the predicting timestep.

5.Conclusion
Transportation plays a vital role in our everyday life. However, due to complex temporal and spatial features, accurate traffic speed prediction is a challenging problem, and the existing traffic forecasting methods are effective in the short-term forecast, but the errors of these methods are large in the medium-term or long-term forecast.
In order to address the above problem, we propose the ASTCN method. ASTCN introduces temporal attention convolution network and spatial attention to extract temporal and spatial features respectively, and uses spatial-temporal feature fusion module to fuse the spatial and temporal features. The experiments of ASTCN on three real data sets show that ASTCN has better performance than baseline methods in traffic speed prediction, not only in short-term prediction, but also in medium and long-term prediction.
Since ASTCN is a general spatial-temporal prediction framework, we can also apply it to other spatial-temporal prediction tasks (precipitation forecast, etc.). In traffic forecasting, we will consider some external factors, such as weather conditions, poi (point of interest), accidents, social activities (holidays, etc.) in the future.

Data availability
The datasets generated during the current study are available from the corresponding author on reasonable request.