ST-TrafficNet: A Spatial-Temporal Deep Learning Network for Traffic Forecasting

This paper presents a spatial-temporal deep learning network, termed ST-TrafficNet, for traffic flow forecasting. Recent deep learning methods highly relate accurate predetermined graph structure for the complex spatial dependencies of traffic flow, and ineffectively harvest high dimensional temporal features of the traffic flow. In this paper, a novel multi-diffusion convolution block constructed by an attentive diffusion convolution and bidirectional diffusion convolution is proposed, which is capable to extract precise potential spatial dependencies. Moreover, a stacked Long Short-Term Memory (LSTM) block is adopted to capture high-dimensional temporal features. By integrating the two blocks, the ST-TrafficNet can learn the spatial-temporal dependencies of intricate traffic data accurately. The performance of the ST-TrafficNet has been evaluated on two real-world benchmark datasets by comparing it with three commonly-used methods and seven state-of-the-art ones. The Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE) of the proposed method outperform not only the commonly-used methods, but also the state-of-the-art ones in 15 min, 30 min, and 60 min time-steps.


Introduction
With the advance of intelligent transportation systems (ITSs), traffic forecasting has received increasing attention since accurate traffic forecasting plays a significant role in various ITSs, including traffic signal control system [1], navigation system [2], and route guidance system [3].
The goal of traffic forecasting is to forecast the traffic conditions (e.g., traffic flow and speed) of several future time-steps given the historic traffic data [4]. However, the task is challenging due to the natural complexity and uncertainty of traffic patterns. As shown in Figure 1, many loop sensors are planted under the roads, and the function of them is to collect traffic data by detecting every passing vehicle. Then, the traffic data of the entire traffic network are sampled as traffic graph signals. Each node of the traffic network graph represents a sensor and the signals of the node are the traffic data records. On the one hand, the nodes in the same traffic stream are related to each other, for instance, the pattern of upstream node signals will appear soon in the downstream node signals. Moreover, the continuous node signals have seasonality and trend, which means the patterns of weekday are similar to each other while different from the patterns of weekend and the trend of vehicle number is rising year by year. Both the spatial and temporal features are important to traffic forecasting, and one main Figure 1. Complex spatial and temporal dependencies. The traffic network data are sampled as traffic graph signals, which consist of many sets of traffic node signals and the nodes are interrelated as a graph. The temporal dependencies of each set of node signals represent the seasonality and trend. The predetermined graph structure could easily go wrong such as nodes 1 and 2 (false connection), nodes 2 and 5 (missing connection).
Recently, researchers propose to integrate Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) to Recurrent Neural Network (RNN) or Long Short-Term Memory (LSTM) to simultaneously explore the spatial-temporal relationships inside the traffic flow data [5][6][7][8]. However, such methods face two major shortcomings. First, the GCN-based approaches require a predetermined graph structure, which is constructed based on human knowledge or a simple algorithm using the distance from node to node [6], and assume it reflects the genuine spatial dependencies of the whole traffic network. However, the predetermined graph structure could have mistakes under several conditions. For instance, as shown in Figure 1, nodes 1 and 2 could be connected due to the short-distance but they are not spatially related since they are planted in opposite direction carriageways, nodes 2 and 5 could have no correspondence in the graph because of the long-distance but they are in the same traffic stream and highly related. Furthermore, the traffic flow is not only affected by the property of a road network, such as the number of lanes or the slope of the pavement, but also the social circumstances near the road segment. The social circumstances include a series of interactional characteristics, such as the economics of the area, the demographic density, the territorial functions, etc. The economics of the area, or the demographic are difficult to be quantitatively defined in a traffic graph structure. In this regard, a predefined graph structure that only considers the connectivity or the length between two road segments is not enough to fully describe the complex relationship between them. Therefore the GCN-based methods easily suffer from incorrect graph structure information. Second, current traffic forecasting methods are ineffective to learn high-dimension temporal features of intricate traffic graph signals. Traditional RNN-based methods have limited capability to handle the long-range sequences due to the explosion problem [7,9]. Although many studies adopt LSTM, an advanced RNN, to deal with the problem, still it is difficult to learn high-dimension temporal features from traffic data [8].
To address these two shortcomings, we present a spatial-temporal deep learning network (ST-TrafficNet) for traffic forecasting. We propose a Multi-Diffusion Convolution (MDC) block in which three types of diffusion convolution (i.e., forward, backward, and attentive) capture spatial dependencies in parallel. Especially, the Attentive Diffusion Convolution (ADC) introduces Graph Attention Mechanism (GAM) into the diffusion convolution process to learn graph structure information from traffic graph signals without prior graph structure knowledge. Motivated by a previous study [10], we employ stacked LSTM to harvest high-dimension temporal features and present Stacked-LSTM block to promote temporal learning ability. We further combine the MDC block with stacked LSTM block to construct the spatial-temporal layer and extract spatial-temporal features end-to-end. With the support of residual connection, multiple spatial-temporal layers are cascaded together and cope with the intricate traffic graph signals efficiently and effectively. We summarize the main contributions of this work as follows: • We propose attentive diffusion convolution to uncover unseen spatial dependencies from traffic graph signals automatically and further present the multi-diffusion convolution block to harvest spatial features in various manners. Extensive experiments demonstrate the ability of our MDC to improve the results when the graph structure is false or unknown.

•
We construct a novel deep learning hybrid network, the ST-TrafficNet, for spatial-temporal traffic forecasting. The holistic ST-TrafficNet is effective and efficient to capture spatial-temporal features with cascading spatial-temporal layers by adopting residual connections. The core idea of the spatial-temporal layer is to enable our proposed MDC block to tackle spatial dependencies of traffic graph signals with high-dimension temporal features extracted by stacked LSTM block.

•
We evaluate ST-TrafficNet on two benchmark datasets and compare it with various baseline methods for traffic forecasting. The experiments show that our proposed method achieves state-of-the-art results in terms of three widely used criteria.

Related Works
Traffic flow forecasting is a classical problem that has been being searched for decades [11]. Early traffic forecasting studies mainly employ model-driven approaches such as Autoregressive Integrated Moving Average (ARIMA) [12] and Kalman Filter (KF) family [13][14][15][16]. Although the data-driven methods have been adopted in research and real world applications wildly [17,18], these methods fail to deal with complex non-linear traffic node signals due to the stationary assumption of time-series, which only satisfied under limited conditions [6]. The data-driven methods, however, employ machine learning approaches to discover the traffic patterns in the historic traffic data automatically. With the use of machine learning algorithms such as Support Vector Regression (SVR) [19,20], Extreme Learning Machine (ELM) [21], k-Nearest Neighbors algorithm (kNN) [22,23], the early data-driven method is capable to handle the intricate traffic data in high-dimensional Euclidean space considering the non-linearity of data; hence, it achieve remarkable results in traffic forecasting.
Inspired by the great advances of deep learning, recent researches have further boosted the performance of data-driven methods by adopting various sequence-to-sequence deep learning neural networks such as Deep Belief Network (DBN) [24] and Stacked Autoencoder (SAE) [25,26]. Especially, the Recurrent Neural Network (RNN) based data-driven methods are proved to be practical to harvest temporal dependencies of traffic node signals [27][28][29]. However, the spatial correlations are often neglected or barely leveraged with RNN-based methods, and thus they are inefficient in processing the spatial-temporal dependencies of traffic graph signals.
To fully exploit the unique spatial-temporal patterns of traffic network data, researches further integrate Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) into RNN. Zhang et al. modeled the spatial dependencies as a heatmap image and used a branch of CNN units to extract different spatial properties of crowd traffic [5]. Yao et al. proposed a gated CNN mechanism to capture the potential spatial features and combined them with temporal features that captured by Long Short-Term Memory (LSTM) to tackle the spatial-temporal traffic forecasting problem [8]. However, the applications of CNN are limited on grid structures, while the traffic network has a topology nature as a graph. Compare to CNN methods, GCN methods are more capable to deal with the traffic network graph structure, since the convolution process is done node by node on the graph [30]. Seo [31]; hence, it can capture spatial-temporal dependencies with a recurrent random walks process on traffic network data [6]. Most recently, the hybrid GCN-RNN models achieve state-of-the-art performance of traffic forecasting [32]. These methods require predefined graph structures to function well, which are highly related on the domain knowledge of the traffic engineers. Although the technical conditions of the road pavements are the same, subjected to the social circumstance, such the economics and the demography, the traffic flow pattern may be totally different. Furthermore, the methods suffer from the ineffectiveness to learn high-dimensional temporal features of intricate traffic signals. In this paper, we propose a spatial-temporal deep learning network to address these two shortcomings.
The rest of this paper is organized as follows. Section 3 is the preliminary knowledge of the traffic forecasting, graph diffusion convolution, and graph attention mechanism. Section 4 presents the proposed ST-TrafficNet in detail. Section 5 applies a series of experiments on two benchmark datasets to verify the performance of the proposed method. Section 6 concludes the proposed work and provides some future research directions.

Traffic Forecasting Modeling
Traffic forecasting is a prediction task to forecast future traffic conditions (e.g., traffic flow, speed) given historical traffic network graph signal observations from a set of sensors in the traffic network. A traffic network can be represented as a graph G = (V, E ), where V is the set of nodes and v ∈ V denotes a sensor on the traffic network, E is the set of edges and e ∈ E denotes a road segment. At time-step t, the graph signal can be observed as X ∈ R N×D , where N is the number of the nodes and D is the number of traffic parameters (e.g., velocity, volume). Given a graph G and the past M step graph signals, the problem aims to learn a function f (·) to forecast the following H step graph signals: when the graph structure is unavailable, a learned function h(·) should map the historical node signals to the future without graph structure information input:

Graph Diffusion Convolution
Graph diffusion convolution was proposed by Li et al. and proved to be effective to capture spatial dependencies of graph signals [6]. A diffusion process on Graph G is defined as a random walk process with restart probability α and transition matrix D −1 O W, where D O = diag(Wu) denotes the out-degree diagonal matrix and u ∈ R N is a vector of all ones. As a typical Markov process, the diffusion process is able to converge to a stationary distribution S ∈ R N×N after many time-steps. The ith row of S represents the spatial dependencies between the node v i ∈ V and the others; hence, it captures the latent spatial feature of G. A closed-form of the diffusion process is proposed to calculate it as a weighted combination of infinite random walks on the graph [33]: where k is the random walk step. To apply diffusion progress on the task of capturing spatial relationship in graph signals X ∈ R N×D with function f , Li et al. propose diffusion convolution [6], which considers finite time-steps K and assign trainable weight parameters: where P ∈ R K denotes the trainable parameters for each step. In the case of a directed graph, diffusion in one direction only is not enough to discover the latent graph information. Therefore, the diffusion convolution has two directions, forward and backward, and the bidirectional diffusion convolution can be defined as: where D −1 I W is the transition matrix of backward diffusion and P f , P b represent the trainable parameters for each step of forward and backward diffusion convolution, respectively. By employing bidirectional diffusion convolution, spatial dependencies can be captured more flexibly on various kinds of graphs.

Graph Attention Mechanism
Attention mechanisms, especially graph attention mechanisms [34], have been widely applied to various graph modeling domains due to their flexibility and high efficiency in learning spatial dependencies [35][36][37]. The graph attention mechanism is an enhanced self-attention strategy [35] that is applied on a Graph Convolution Network [30,38]. The most significant improvement of GAM is the way it accumulates the features of adjacent nodes during convolution. We formulate a graph convolution process at the graph level includes the standardized sum of the features of neighbor nodes as: where σ denotes a non-linear activation function (e.g., ReLU, Sigmoid), C is the matrix of standardized constant induced by graph structure, φ l is the trainable weight matrix for node feature transformation in current layer, and H l , H l+1 represent the nodes features of the current and next layer, respectively. GCN requires prior graph structure knowledge to calculate the relationship among nodes, which often is missed in various tasks. Instead of using a standardized constant matrix, GAM uses the attention coefficients matrix to discover graph structure automatically.

Given the layer input
where N is the number of nodes and D denotes the number of features of each node. To transform the input into higher dimension features, a shared weight matrix φ ∈ R D×D is adopted: where a denotes the self-attention mechanism and E is the original attention coefficients matrix. The normalized attention coefficients matrix is then calculated: A leakyReLU activation function is used to eliminate weak attention, then the softmax function is applied to normalize the attention coefficients matrix into an easily comparable form.
Finally, the normalize attention coefficients matrix A is used to update GCN by replacing the standardized constant matrix C [38]:

Spatial Aware Multi-Diffusion Convolution Block
In our work, we propose multi-diffusion convolution, in which we introduce GAM into diffusion convolution and present attentive diffusion convolution. Compared to the bidirectional diffusion convolution, ADC does not require any predetermined graph structure knowledge but learns it by self-attention, thus is able to capture hidden spatial dependencies automatically. We first initialize two node embedding vectors N h , N t ∈ R N , named head node and tail node, respectively. Then we construct a trainable transition matrix by multiplying N h and N t , and induce an attentive transition matrix A att with GAM: A att = so f tmax(leakyReLU(N h N T t )).
The transition matrix N h N T t implies the potential spatial dependencies from each N h to each N T t . After connecting nodes, a leakyReLU activation function is used to eliminate weak connection, then a softmax function is adopted to normalize the transition matrix and make sure that it converges into a stable status after the training process. In the case of the predetermined graph, both the bidirectional diffusion and ADC are available in MDC: The three kinds of diffusion convolution in MDC are shown in Figure 2. In the forward diffusion convolution process, it diffuses from the current node to each kth in the nodes chain and every ith nodes share the same weight while i ≤ k. In the backward diffusion convolution process, it diffuses from each node that can reach the current node with k time-steps. In the ADC process, however, it does not share any weight. The weight between every two nodes is calculated based on GAM and used only in one nodes chain. Since the weight is trainable, ADC is able to correct false or missing connections. The MDC combines different diffusion convolution together to discover the spatial dependencies with various kinds of graph structure flexibly and accurately. In the case of missing graph structure, we propose using ADC independently to discover the potential spatial relationship: , the direction of the node chain denotes the diffusion direction of current node 1, and the nodes with the same color share the same weight in the diffusion process (e.g., nodes 5,6 share the same weight with time-step k = 2). In (c), the edges indicate the relationship between two nodes (e.g., the dotted edge <1,4> indicates a weak connection based on Graph Attention Mechanism (GAM)).
Next, we further propose an MDC block that can be trained end-to-end through stochastic gradient descent, as illustrated in Figure 3. First, we use a 1 × 1 convolution to simulate a hidden-to-hidden data translation and increase the feature dimensions of the input traffic graph signal features in order to enhance learning ability. Then we perform MDC with three channels, which are forward channel, backward channel, and attentive, respectively. Each channel implements diffusion convolution on high-dimension traffic graph signal features in parallel. Lastly, we concatenate the results and use a 1 × 1 convolution to simulate a hidden-to-output translation and produce the output MDC features.

Temporal Aware Stacked LSTM Block
We adopt the Long Short-Term Memory recurrent network [39] to capture the temporal trend and seasonality of node signals. As a practical variant of RNN, LSTM allows controlling the flow of information by employing the gated units and cell states; hence, it is able to solve the vanishing gradient problem efficiently. A typical LSTM network consists of LSTM cells, which contain input gate i, forget gate f , and output gate o. With the gated mechanism, LSTM cells are capable to add or remove information from cell state c and generate layer output h. Given the current node signal x t , the LSTM can be represented as four iterating modules: where σ denotes the element-wise sigmoid function and tanh is the element-wise hyperbolic tangent function. W is the weighted transition matrix of the iteration process. After an iteration, the cell state c t is updated and layer output h t is produced: where denotes element-wise product. Furthermore, we present a stacked LSTM block, as illustrated in Figure 4 (left-bottom), to capture high-dimension temporal features to discover the temporal dependencies of the traffic graph signals. We first resize the input graph signal's feature into individual node signals employing 1 × 1 convolution. We then stack multiple LSTM layers to improve the performance of LSTM capturing temporal features with multi scales of time (e.g., day, week, month) [40]. The second LSTM layer receives the hidden cell states of the previous layer as the node signals input; thus, it can harvest relatively high-dimension features based on the low-dimension features input. A ReLu function follows the staked LSTMs and a batchnormalizer [41] regularizes the features to eliminate weak temporal multi-scale features. Instead of element dropout, we apply temporal dropout strategy [42] to zero the entire temporal features with a dropout rate d θ . Finally, we adopt another 1 × 1 convolution to resize the node signals into graph signals for the following spatial-aware block.

Framework of Spatial-temporal Deep Leaning Network
We present an end-to-end deep learning framework to tackle the spatial-temporal traffic forecasting problem in Figure 4 (top). The proposed framework consists of a linear function to rescale the input data, stacked spatial-temporal layers with residual connection, and an output layer. A spatial-temporal layer is constructed by a LSTM-based block and MDC block, shown in Figure 4 (bottom left) and Figure 4 (bottom right), respectively. The residual connection is adopted in the workflow of the LSTM-based block and MDC block to stabilize the learning process when it goes deep [43] and harvest multi-scale spatial-temporal features. At the first few layers, the stacked LSTM-based block captures relatively low-dimension temporal features and has a limited descriptive ability for temporal traffic data; hence, the MDC block can only learn the spatial features of few temporal traffic patterns. As the network goes deeper, high-dimension temporal features are harvest by stacked LSTM, and the MDC block is able to discover potential spatial dependencies of multi-scale temporal traffic patterns.
It then goes on an output layer equipped with the attention mechanism [35] forecasts the future traffic graph signal: whereX is the forecast result, F t ∈ R N×D denotes the spatial-temporal features at time t, and W o represents a trainable weight matrix. Especially, the Frobenius inner product ψ(·, ·) of F T and F t maps the current features to historical feature according to the similarity. Lastly, we use the MSE loss function to train the end-to-end model with the ground truth traffic graph signalsX: (2) we extract seasonality and trend features from node signals using a Long Short-Term Memory (LSTM)-based block and resize them into graph signals; (3) we use MDC block to capture spatial dependencies of previous graph signals; (4) we combine the two blocks and stack them with residual connection, which enables the model to train deeper and learn multi-scale temporal-spatial features; (5) finally, an output layer equipped with an attention mechanism is used to generate the forecasting result.

Data Preparation
We employ two real-world large-scale benchmark datasets, which are widely used [6,44,45], to verify ST-TrafficNet. These two datasets not only include rush hours and non-rush hours, but also weekdays and weekends. The first dataset is from the highway of Los Angeles County (METR-LA). The traffic data from a road network consist of 207 loop detectors collected from 1 March 2012 to 30 June 2012, see Figure 5 for more details. The second dataset was collected from the largest publicly available database, e.g., Caltrans Performance Measurement System (PeMS). Three hundred and twenty-five loop detectors in the Bay Area (PEMS-BAY) are selected and the data were collected from 1 January 2017 to 31 May 2017, see Figure 6 for more details. The vehicle loop detector is a kind of sensor that detects vehicles passing or arriving at a road segment. It is often installed under the pavement. The wire loops are supplied alternating current at frequencies between 10 kHz to 200 kHz by the electronics unit. In this way, the loop wire behaves as a tuned electrical circuit. When a vehicle passes over the loop or is stopped within the loop, the ferrous body material of the vehicle changes the inductance of the loop wire. This change can be detected to report the occupation of the loop, see [46] for more details. The original data collected from sensors are sliced with a 5-min time interval window. The traffic flow data aggregated in small time intervals are easily affected by outliers, which will decrease the accuracy of the forecasting models. On the other hand, traffic flow data of too large time intervals provide little information for forecasting [14]. Since we do not intend to forecast the minute to minute fluctuations, the 5-min time interval window is a rational setting. To initialize the graph structure of the traffic network, a Gaussian kernel thresholded algorithm [47] is used to calculate the correlation between every two nodes and construct the adjacency matrix W = w ij of the graph by: where dist(v i , v j ) denotes the distance of the road segment between v i and v j . σ is the standard deviation of distances and κ is the threshold. The dataset is then Z-score normalized and split into training set (70%), validation set (10%), and test set (20%).

Experimental Setup and Evaluation Criteria
All the experiments were conducted under a computer environment with dual NVIDIA GeForce RTX 2080ti GPU and an Intel(R)Core(TM)i9-7900X CPU @3.30GHz. Eight spatial-temporal layers were cascaded together to harvest spatial-temporal features. The random walk step K in the MDC process was set to 2, and the time-step of stacked LSTM was set to 32 and 128 for two LSTM layers, respectively. The dropout rate of stacked LSTM was 0.2. ST-TrafficNet is trained based on the Adam optimizer [50] for 200 epochs with early stopping. The initial learning rate was 0.001 with a weight decay of 3 × 10 −4 ; the batch size was set to 128.
Three criteria are employed to quantitatively evaluate the traffic forecasting performance: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE), defined as follows: where X t is the observed graph signal at time-step t, andX t is the predicted graph signal.

Performance Comparison
Tables 1 and 2 report the comparison of ST-TrafficNet and 10 baseline methods on two benchmark datasets for 3, 6, and 12 time-steps, respectively. The experiment result shows that ST-TrafficNet surpasses the baseline methods on both benchmark datasets for different scales of time.
We can observe that data-driven methods achieve a greater performance than the model-driven methods (HA, ARIMA, and LSVR), demonstrating the ability of data-driven methods to learn more comprehensive features by deploying neural network architecture. Compared to the temporal data-driven methods (FNN, F-CLSTM, and WaveNet), the spatial-temporal data-driven methods are capable to use the graph structure of traffic network to help forecasting graph signal on each node, therefore outperforming all the temporal data-driven methods. ST-TrafficNet is superior to the state-of-the-art spatial-temporal data-driven methods. Regarding the DCRNN method, the mean MAE on two datasets of ST-TrafficNet outperforms DCRNN by 8.15% (15 min), 8.65% (30 min), and 9.4% (60 min). DCRNN is comparable to ST-TrafficNet in terms of neural network structure since it combines diffusion convolution with Recurrent Neural Network together to forecasting spatial-temporal traffic graph signal. In that respect, ST-TrafficNet is more capable to capture complicated and subtle spatial-temporal patterns for three reasons: (1) The attentive diffusion convolution in the MDC block provides a more comprehensive understanding of the spatial dependencies than the predefined graph structure based on human knowledge, which is adopted in DCRNN. Consequently, ST-TrafficNet outperforms DCRNN for all three time scales. (2) Compared to the simple Recurrent Neural Network, temporal-aware LSTM-based blocks in each spatial-temporal layer can tackle high-dimension temporal features and describe the temporal node signals; hence, the ST-TrafficNet is able to deal with short-term and long-term temporal dependencies, respectively. Therefore, the MAE drop at 60 min time-step is greater than 30 min and 15 min. (3) The stacked neural network is always shallow on account of the gradient explosion issue. However, ST-TrafficNet stack spatial-temporal layers with residual connections, and stabilize the performance of each layer. We further provide a visual comparison between DCRNN and ST-TrafficNet in Figure 7, which shows the forecasting result and ground truth traffic signals of a node in a day. The node locates in a dense area of PeMS-BAY; hence, the spatial dependencies are significant to improve the forecasting performance. As the comparison shows, the curve of ST-TrafficNet is closer to the ground truth and the curve of DCRNN is smoother while the ground truth changes rapidly. At the peak hour of the day, an impulsive disturbance is created by unexpected factors. The DCRNN is affected and the prediction deviates far from ground truth, but the stable performance of ST-TrafficNet remains due to the benefit of using the self-learned spatial features.

Efficacy of Multi-Diffusion Convolution Block
We further conduct experiments on ST-TrafficNet with four different MDC block configurations to verify the efficacy of the proposed MDC block. Firstly, we take off the MDC block from ST-TrafficNet to verify the network with only temporal features as a benchmark. Secondly, an MDC block consists of only attentive channel is used to verify the ability to learn potential spatial features without any prior graph structure knowledge. We then configure the MDC block with the forward and backward channels, which are flexible to capture spatial dependencies from the predetermined graph. The last ST-TrafficNet adopted a three-channel NDC block, which makes it to be the proposed ST-TrafficNet. We compare the four different ST-TrafficNets on both METR-LA and PeMS-BAY datasets with 15 min, 30 min, and 60 min time-steps. The experiment results are shown in Table 3. The performance of three ST-TrafficNets using diffusion convolution surpasses the temporal-only ST-TrafficNet largely, indicating the significance of diffusion convolution capturing spatial features while tackling spatial-temporal traffic forecasting. By employing the attentive diffusion channel only MDC block, attentive-only ST-TrafficNet performs just a little poorer than forward-backward ST-TrafficNet, showing the functionality of attentive diffusion convolution while no graph structure information is available. The proposed ST-TrafficNet outperforms the forward-backward ST-TrafficNet with the attentive channel, indicating that even if the prior graph structure knowledge is given, the MDC block can still discover useful spatial features. The computational times of ARIMA and ST-TrafficNet on dataset METR-LA are listed in Table 4. The ARIMA(p, q, d) needs not really to be trained. It needs to properly select the order of Auto Regressive (AR) term p, the order of the moving average (MA) term q, and the number of differencing d. The order of the AR term is the number of lags to be used as predictors in the linear regression model. Traffic engineers determined the order of the AR term to guarantee the predictors are not correlated and are independent of each other according the traffic situation of the road segment. The MA term refers to the number of lagged forecast errors that go into the ARIMA. The number of differencing depends on the complexity of the traffic flow, more than one differencing may be needed. More details of ARMIA could be found in [51]. Our model takes 83.74 seconds for every epoch. It converges after 50 to 60 epochs. Different from the parametric models, training a deep learning network requires tedious time to iteratively optimize the parameters in each layer, but the prediction needs only one forward propagation. Thus, the prediction stage is much faster than its training stage. Fortunately, the deep learning network only needs to be trained once, and can be conducted off-line. Moreover, both stages can accelerate by the GPUs. Our model takes 274% more computational time than the ARIMA, but achieves over 43.19% more accuracy of MAE (average by 15/30/60 min) than the ARIMA.

Influence of Missing and Incorrect Graph Structure Knowledge
To many data-driven spatial-temporal methods, good quality of traffic graph signals and graph structure is significant. However, traffic network graph structure from human knowledge or simple distance algorithm could have false connections (e.g., two close nodes planted on opposite direction carriageways) or missing connections (e.g., two distant nodes planted on the same traffic stream). Therefore, we conduct experiments to test the performance of ST-TrafficNet with or without attentive diffusion convolution on missing and incorrect graph structure. We first sample the noise from Gaussian distributions with a variance which is 5% or 10% of the mean values of graph structure data, and the test data are generated by adding up the noise and original graph structure data. Then another set of graph structure data is zero with the rate 50% or 100% to produce the missing dataset. Both the experiment employ PeMS-BAY as the original dataset with 15 min time-steps.
The MAE scores of two experiments are presented in Table 5. When the predetermined graph structure data are disturbed by 5% Gaussian noise, the MAE score of non-attentive ST-TrafficNet convolution degenerates by 3% while the proposed ST-TrafficNet almost does not change, and the performance of non-attentive ST-TrafficNet degenerate more rapidly while the ST-TrafficNet decay with a relatively slower rate under a 10% Gaussian noise condition due to the ability to correct the false connection. As for the missing data, the non-attentive ST-TrafficNet turns into a temporal-only ST-TrafficNet when the missing rate goes to 100%, but ST-TrafficNet can use attentive diffusion convolution to learn latent spatial dependencies from graph signals and deliver higher performance. Consequently, the MAE score of non-attentive ST-TrafficNet degenerates by 51.9% and only an 11.9% penalty is developed for ST-TrafficNet. Table 5. The Mean Absolute Error (MAE) score of ST-TrafficNet with or without attentive diffusion convolution on missing and incorrect graph structure. Under the condition of incorrect graph structure with 5% noise, the proposed model remains nearly the same while the non-attentive model degenerates by 3%. Under the condition of unavailable predetermined graph structure (i.e., missing rate 100%), the non-attentive model dropped by 51.9% and the proposed model only degenerates by 11.9% The traffic flow speed is critically important to discriminate the traffic state. Actually, our method can predict traffic speed if the training data contain traffic speed. Furthermore, our method can be extended to other spatial-temporal applications, such as grid load forecasting.

Conclusions
In this paper, we propose a novel spatial-temporal deep learning network for traffic forecasting. The proposed attentive diffusion convolution can automatically capture various spatial dependencies, which are difficult if not impossible to be predefined by traffic engineers. Equipped with our attentive diffusion convolution and cascaded LSTM block, our ST-TrafficNet effectively uncovers the spatial-temporal relations inside the traffic flow data. Sufficient experiments on two public benchmark traffic flow datasets show that the proposed model achieves state-of-the-art performance. The future works will be focused on two aspects. First, we are trying to design new deep learning networks that not only consider the connectivity of the traffic graph, but also the social and economic factors for accurate and timely traffic flow forecasting. Second, we shall apply our methods to more multi-dimensional time series analysis applications, such as grid load forecasting.