Traffic Flow Prediction Based on Dynamic Graph Spatial-Temporal Neural Network

More accurate traffic prediction can further improve the efficiency of intelligent transportation systems. However, the complex spatiotemporal correlation issues in transportation networks pose great challenges. In the past, people have carried out a great deal of research to solve this problem. Most studies are based on graph neural networks to model traffic graphs and attempt to use fixed graph structures to obtain relationships between nodes. However, due to the timevarying spatial correlation of the transportation network, there is no stable node relationship. To address the above issues, we propose a new traffic prediction framework called the Dynamic Graph Spatial-Temporal Neural Network (DGSTN). Unlike other models that use predefined graphs, this model represents stable node relationships and time-varying node relationships by constructing static topology maps and dynamic information maps during the training and testing stages, to capture hidden node relationships and time-varying spatial correlations. In terms of network architecture, we designed multi-scale causal convolution and adaptive spatial self-attention mechanisms to capture temporal and spatial features, respectively, and assisted learning through static and dynamic graphs. The proposed framework has been tested on two real-world traffic datasets and can achieve state-of-the-art performance.


Introduction
Increasing vehicle ownership and travel demand have led to huge pressure on traffic management, and effectively optimizing and positioning traffic resources has become a challenge. The emergence of intelligent transportation systems (ITS) makes it very possible to solve this challenge. For example, intelligent transportation systems mainly include intelligent transportation infrastructure and data analysis algorithms such as computer vision methods for real-time monitoring of the Internet of Vehicles [1], intelligent Internet of Vehicles solutions [2,3], traffic signal controls [4], and reinforcement learning for autonomous driving [5,6], etc.
As an essential part of intelligent transportation system technology, traffic forecasting plays a vital role in solving these problems by using historical traffic data to predict future traffic conditions and helping to reduce traffic congestion by realizing effective coordination between passengers, vehicles, and roads.
Early statistical models such as support vector machine (VAR) [7], autoregressive integrated moving average (ARIMA) [8], and HA, were used for traffic prediction problems, but these models did not perform well in practice. Subsequently, some machine-learning models, such as support vector machines (SVM) [9] and the k-nearest neighbors algorithm (KNN) [10], which could model nonlinear traffic data began to emerge, but their accuracy was hindered by time-consuming and complex feature engineering. Thanks to the success was hindered by time-consuming and complex feature engineering. Thanks to the success of deep-learning networks in modeling time series [11,12], a large amount of literature has begun to use deep-learning models for traffic prediction problems. The most widely used is the recurrent neural network (RNN) [13,14], but it is prone to vanishing gradient problems when modeling long sequences. For spatiotemporal data, exploring the spatial correlation between data is also very important. The spatial convolution neural network (CNN) uses convolution to extract hidden information from data by dividing the research area into grids. However, this method of dividing the road network into grids is very different from the structure of the real road network [15][16][17][18]. Graph neural networks have achieved great success in processing graph topology. Time convolutional network (TCN) is an architecture that convolves in the time dimension [19,20], using extended convolution to achieve exponential-level perception fields. However, due to the use of exponential-level perception fields, they cannot effectively capture cyclic changes in traffic conditions [21]. Graph convolution neural networks (GCN) can be used because the spatialtemporal correlation of non-Euclidean space structure applies to the road network [22].
Existing models, while effective at capturing spatial-temporal correlation, still face two enormous challenges. First, we show that the spatial correlation between actual nodes changes dynamically over time rather than being static and invariant. Traffic conditions detected by a specific sensor are difficult to capture. As can be seen in Figure 1a, for example, the spatial correlation between nodes A and B is very high in the morning hours but then weakens between nodes C and D during the afternoon. Second, most models typically use a predefined static graph structure to describe the actual road network relationship; the graph of the adjacency matrix constructed using Euclidean distance does not really reflect the spatial relationship of the road network, e.g., Figure 1b shows that nodes C and D are two nonadjacent sensor nodes which are highly correlated spatially. To solve the above problems, we propose a new framework to solve the traffic flow problem: the Dynamic Graph Spatial-Temporal Neural Network (DGSTN) is used to predict the traffic flow of each sensor in the traffic network. We designed a set of adaptive graphs, including static graphs that can capture real node correlations in road networks and dynamic graphs that capture dynamic spatial correlations. In addition, for the model, temporal characteristics were captured by constructing multi-scale temporal convolutions, and time-varying spatial characteristics were captured according to an adaptive spatial self-attention mechanism and the proposed static topology and dynamic information graphs. Specifically: the main contributions of this paper are as follows: 1. A multi-scale time-gated convolution is proposed to capture different temporal finesse and, based on an improved adaptive spatial self-attention mechanism, the node correlation of the real spatial relationship is calculated. To solve the above problems, we propose a new framework to solve the traffic flow problem: the Dynamic Graph Spatial-Temporal Neural Network (DGSTN) is used to predict the traffic flow of each sensor in the traffic network. We designed a set of adaptive graphs, including static graphs that can capture real node correlations in road networks and dynamic graphs that capture dynamic spatial correlations. In addition, for the model, temporal characteristics were captured by constructing multi-scale temporal convolutions, and time-varying spatial characteristics were captured according to an adaptive spatial selfattention mechanism and the proposed static topology and dynamic information graphs. Specifically: the main contributions of this paper are as follows:

1.
A multi-scale time-gated convolution is proposed to capture different temporal finesse and, based on an improved adaptive spatial self-attention mechanism, the node correlation of the real spatial relationship is calculated.

2.
The design is a set of adaptive graphs: a static topology graph combined with an adjacency matrix as prior information and an adaptive embedding matrix to capture real node dependencies. By capturing the similarity of changes in flow information, a set of dynamic information graphs is constructed to obtain a dynamic spatial correlation.

3.
Results tested on two real-world datasets and show that the framework proposed in this paper achieves the best results when compared with various baselines.

Space-Time Traffic Forecast
The prediction of traffic flow is a fundamental problem in intelligent transportation and has been extensively studied by many researchers over time, with applications in a wide range of areas. Initial research was focused on statistically based methods including VAR [7], ARIMA [8], and HA. These models are underpinned by mathematical theory, but they rely on the assumption of linearity in the prediction task, which is not consistent with the nonlinear nature of the traffic data, leading to poor prediction results. Models based on machine learning may be a good solution to this problem e.g., SVM [9] and KNN [10], but good results rely on high-quality manual feature generation, which undoubtedly leads to complex and time-consuming modeling. Models based on deep learning perform well in other domains, automatically extracting network features from a given dataset, obviating the need for manual feature generation, and alleviating modeling complexity. The success of convolutional neural networks (CNN) in computer vision tasks has been so great that some academics have used them for traffic [13,14,23]. However, modeling methods that use mesh partitioning for convolutional operations are not, on their own, sufficient to capture the topology of road networks. The RNN-based model is very suitable for sequence data. It is a widely used practice to capture temporal and spatial features by combining long Short-term Memory (LSTM) and convolutional neural networks [14]. However, these methods can only use off-the-shelf solutions without exploring the correlation of different regions, and it is more challenging to preserve long-term information for long-sequence problems in recursive sequence models such as RNN and LSTM.

Graph Convolution
The recent emergence of graph convolutional networks is well suited to traffic prediction tasks, given that traffic path networks are natural graph topologies. Much of the work is based on modeling in either the time and space dimensions separately, or in the space-time dimension. Since traffic data must be considered correlated in both time and space dimensions, a common approach to combining convolution with recursive models such as RNNs is to use convolution in place of the matrix multiplication operation of recursive models with convolution on a local spatial-temporal graph [10,18]. Most tasks in spatial dependency capture use predefined graphs built based on Euclidean distances to capture spatial dependencies, but in real-world traffic networks two sensors that have similar Euclidean distances may not exhibit strong spatial dependencies for certain specific purposes (intersections, roadway closures, two opposite lanes, roadway area functions). To address this issue, refs. [16,24] make use of adaptively learnable graph structures to capture the real spatial dependencies.

Attention Mechanisms
Attention mechanisms first appeared in the modeling of problems for natural language processing and are now used in a wide range of domains [25,26], providing efficient improvements to many tasks [27,28]. One immediate goal of using attention mechanisms is to score the various dimensions of the input and then weigh the features based on their scores to highlight the impact that important features have on downstream models or modules. The basic idea of the attention mechanism is as follows. Numerous models that have emerged in the field of traffic prediction also prove the efficacy of attention, such as [29][30][31].
Attention becomes self-attentive when the query, the key, and the value are the same, and in sequential tasks, it can parallelize processing and consider information from the global sequence more efficiently and quickly. The emergence of multi-headed attention [32], which can learn the correlation of different subspaces, allows the self-attention mechanism to bring greater flexibility and modeling ease to the problem compared to CNN and RNN.

Problem Formulation
Definition 1. (Road Network). The road network can be viewed as a graph G = (V, E, A), where V = {v 1 , . . . , v N } is a set of N nodes ( N = |V|), E is an edge set, and A is the adjacency matrix of the traffic road network G.

Definition 2.
(Traffic Flow Tensor). We use X t ∈ R N×F to represent the traffic flow of N nodes in the traffic network at time t, F is a feature dimension. We use X = (X 1 , X 2 , . . . , X T ) ∈ R T×N×F to represent the traffic flow of all nodes in the road network at T time slices.
The goal of traffic flow forecasting is to predict the future flow of the entire traffic system through given historical flow information. Formally, our goal is to learn a function f to predict the traffic flow of T time steps in the future given the traffic flow observations of T historical time steps. (1)

Dynamic Graph Spatial-Temporal Neural Network
We show the framework of DGSTN, which consists of an embedding layer, S-T block, and an output layer, in Figure 2a. The input to the model is historical traffic data X = [X (t−T+1) , . . . , X t ] ∈ R T×N×F , and the output is to predict traffic flow Y = [X t+1 , . . . , X t+T ] ∈ R T×N×C over a period of time in the future. The input of each space-time block is H (l−1) ∈ R T×N×D , and the output is H (l) . The details are shown in Figure 2b.

Adaptive Graph
In this section, we describe how to leverage a given information graph so static topology graphs and dynamic information graphs can learn from static topology and dynamic traffic information, respectively. Figure 2c shows the specific details of the adaptive graph.

Adaptive Graph
In this section, we describe how to leverage a given information graph G = (V, E, A, X t−T:t ) so static topology graphs and dynamic information graphs can learn from static topology and dynamic traffic information, respectively. Figure 2c shows the specific details of the adaptive graph.

Static Topology Graph
To solve the difficulty of capturing globally valid information using predefined graphs in most current models, we propose a static topology graph that learns the adjacency matrix of the optimal graph. It learns the implicit information that cannot be captured by predefined graphs, and then projects the hidden relationship into the predefined adjacency matrix to achieve complementary information. First, the initialization of the predefined adjacency matrix graph plays an important role in the learning of the adaptive graph, and we define the initialization as: where A is the adjacency matrix, I is the unit matrix, D ∈ R N×N is the degree matrix, and D ii = ∑ i A ij is constructed as. The specific definition is as follows: Here, E 1 , E 2 ∈ R N×F , Λ ∈ R N is the learnable parameter, N is the size of the node, and F is the dimension of the learnable parameter, where F N. By multiplying E 1 and E 2 , one can generate a sparse graph, where Diag(Λ) is used to generate the weights of the diagonal positions and Relu is used to eliminate the weak connection after adding the two. Note that no prior knowledge is required for the sparse matrix A 1 here, and all parameters are learned end-to-end by stochastic gradient descent. Adaptive modules are then used to adaptively aggregate predefined and learnable sparse matrices.
Here, we perform an adaptive aggregation operation on the resulting sparse matrix A 1 , the predefined matrix L, Sigmoid is a nonlinear activation function and Cov is a convolutional layer of 1 × 1, where represents element multiplication. Since the new matrix is obtained by adaptive aggregation based on predefined graphs A 1 and learnable sparse matrices L, it not only retains the prior features of the predefined graphs L, but also learns node features through training, ensuring faster convergence in the iterative process.
The normalization operation is performed on the final matrix obtained.

Dynamic Information Graph
The spatial correlation between different nodes will change with time, so it is necessary to use node information to mine the change in spatial correlation. First, for the given node information X ∈ R T×N×F , we need to upgrade the feature dimension to D dimension; the specific formula is as follows: In this formula, FC(·) is a fully connected network and S represents the node attributes after linear mapping. To capture the dynamically changing spatial correlation over T time lengths, we need to use a one-dimensional dilated convolution to perform convolution operations on the time dimension: Equation (7) represents a one-dimensional dilated convolution operation. We can stack multiple dilated convolutional layers and aggregate the time dimension to arrive at the formula: Through formula (8), we convert S ∈ R T×N×D into M ∈ R N×D , where the overall parameter of the convolution kernel is w ∈ R T×D×D . In our model, cosine similarity is used to calculate the spatial correlation between two nodes. Therefore, the relationship between two nodes can be expressed as: Here, S ij represents the similarity between node i and node j. The higher the similarity, the stronger the spatial dependence and the higher the spatial correlation. Furthermore, the spatial dynamic graph A d can be expressed as the following form: The Relu activation function here can eliminate negative connections and enhance the nonlinear capability, while the So f tmax function is used to normalize the dynamic information graph.

Multi-Scale Gated Time Convolution
Compared with recursive units, convolutional operations do not require sequential calculations, which can save a great deal of computational time. Compared to self-attention mechanisms that require a large number of parameters, they only require a few parameters, making the models more lightweight. In contrast to the design of TCN, we designed a multi-scale time convolution which consists of three GTU-gated convolutions with a different receptive field.
The input to the time-gated convolutional network is Z (l) ∈ R T×N×D , where l is the number of ST layers. The size of the convolutional kernel is θ ∈ R K×D×2D , and here the output of the network is Z (l) = θ * Z (l) ∈ R N×(T−(S−1))×2D and the whole cell can be defined as follows: Here σ stands for the Tanh activation function and for the Hadamard product. We employ a multi-scale gated convolution module to capture the dynamic temporal information of the traffic by adjusting the size of the convolution kernel to obtain different perceptual fields, which are used to capture the long-term and short-term temporal dependencies. The multi-scale GTU can be represented as follows: Here 1 × k 1 , 1 × k 2 , and 1 × k 3 are the sizes of the convolution kernels of θ 1 , θ 2 , and θ 3 , respectively. The operation fuses the features obtained from the GTU with three different receptive fields, resulting in a feature with a size of 3T − (k 1 + k 2 + k 3 − 3).

Spatial Attention
When modeling spatial-temporal data, spatial correlations change dynamically at different time steps. The most direct way to capture this feature is to use a fully connected spatial attention mechanism (FSA) to obtain the attention of all nodes at different times.
However, in real road networks, many nodes are not directly connected due to geographical location or weak correlation. To address this issue, we employ an adaptive spatial attention method that captures the dynamic spatial correlation between nodes with realistic relationships. Figure 3 demonstrates the difference between fully connected spatial attention and adaptive spatial attention. Among them, green represents the self-connection of nodes, blue represents fully connected spatial attention, indicating that each node is related to each other, and yellow represents adaptive spatial attention, indicating that each node is adaptively associated.
For fully connected spatial attention, it is first necessary to obtain the query, key, and value vectors of the self-attention mechanism: Here, are a set of learnable parameters and ' D is the dimensions of query, key, and value. Next, the dependencies between nodes are computed in the spatial dimension, and the attention scores of all nodes at time t are computed: At different time steps, the attention scores ( ) S t A are different between each node, which can dynamically capture the spatial correlation. Further, the output of spatial selfattention can be obtained by multiplying the attention score ( ) S t A with the Value matrix: The above formula is the standard full-spatial attention, where all nodes are related to each other. The adaptive spatial attention we adopted designs a mask matrix M that performs a masking operation on nodes with less spatial correlation at each time step, and only considers the correlation between nodes with a greater relationship to reality. When the correlation is less than the threshold, this means that the correlation between the two nodes is small so the attention between the two nodes is covered up, and the attention score is set to −∞ . Furthermore, the attention score ( ) S t A can be further tuned by multiplying it with the adaptive graph adap A . Therefore, adaptive spatial attention can be expressed as follows: Among them, green represents the self-connection of nodes, blue represents fully connected spatial attention, indicating that each node is related to each other, and yellow represents adaptive spatial attention, indicating that each node is adaptively associated.
For fully connected spatial attention, it is first necessary to obtain the query, key, and value vectors of the self-attention mechanism: Here, W S Q , W S K , and W S V ∈ R D×D are a set of learnable parameters and D is the dimensions of query, key, and value. Next, the dependencies between nodes are computed in the spatial dimension, and the attention scores of all nodes at time t are computed: At different time steps, the attention scores A (S) t are different between each node, which can dynamically capture the spatial correlation. Further, the output of spatial selfattention can be obtained by multiplying the attention score A (S) t with the Value matrix: The above formula is the standard full-spatial attention, where all nodes are related to each other. The adaptive spatial attention we adopted designs a mask matrix M that performs a masking operation on nodes with less spatial correlation at each time step, and only considers the correlation between nodes with a greater relationship to reality. When the correlation is less than the threshold, this means that the correlation between the two nodes is small so the attention between the two nodes is covered up, and the attention score is set to −∞. Furthermore, the attention score A (S) t can be further tuned by multiplying it with the adaptive graph A adap . Therefore, adaptive spatial attention can be expressed as follows: The symbol here represents the Hadamard product. Adaptive spatial attention accomplishes adaptive modeling of spatial correlations between real nodes. Multi-head spatial attention can be expressed as: By introducing a multi-head spatial-attention mechanism, where parallel attention heads are stitched together, hidden spatial dependencies can be captured from various subspaces.
Since the multi-headed attention mechanism completely discards the convolution and recursive operations, it is necessary to add a mark to each input to represent the timing and position relationship. To this end, we design a spatial embedding (SE) to better capture spatial dependencies. The adaptive graph A adap learned by the graph learning module can be initialized to obtain the spatial embedding S E ∈ R N×N , which can capture the connectivity and distance relationship between nodes, and it can transform the spatial and temporal embeddings linearly and along the temporal and spatial dimensions to generate S E ∈ R T×N×D .

Input and Output Layer
The input layer is used to map the input node to a high-dimensional space, and a 1 × 1 convolution is used to convert the data dimension into X ∈ R T×N×D . To realize the function of multi-step prediction, the output layer uses two 1 × 1 convolutions to convert the hidden dimension into the required dimensionX ∈ R T ×N×C . The loss function uses the mean absolute error between the predicted value Y = [X t+1 , . . . ,X t+T ] and the true value X = [X t−T+1 , . . . , X t ]:

Results and Discussion
In this section, we present the experimental results of the DGSTN and baseline on two spatiotemporal datasets, using multiple evaluation indicators for comprehensive evaluation. During the study, we conducted ablation experiments on the model to analyze the effectiveness of each component and adaptive graph.

Datasets
To evaluate the performance of the proposed model, we conducted experiments on two real-world datasets: PeMS04 and PeMS08. PeMS is a unified database of traffic data collected by California transportation companies and partners on California highways, reporting data every 30 s. The dataset descriptions are as follows: The traffic flow data of the two datasets is recorded every 5 min, with a total of 288 pieces of data per day. The missing values in the above two datasets are filled in by linear interpolation, and the training is made more stable by normalizing the dataset by the standard normalization method x = x − mean(x). In the forecasting process, this paper uses one hour's historical data to predict the next hour's data; that is to say the historical data of 12 time steps is used to predict the future data of the next 12 time steps. The two datasets are divided into training, validation, and test sets in chronological order, with a segmentation ratio of 6:2:2. Table 1 summarizes the key information of these two datasets.

Baseline Method
We compared the proposed framework with baseline methods, including classical methods and advanced neural network methods.
• HA: Historical average value, which uses traffic flow data from the past period and calculates its average value to achieve prediction. : An attention-based spatiotemporal graph convolutional network for traffic flow prediction. By overlaying attention layers and convolutional layers, temporal and spatial features in the data were proposed to obtain more effective temporal and temporal features. • STSGCN [18]: Spatiotemporal synchronous graph convolutional network. To more effectively capture complex local spatiotemporal correlations more, a spatiotemporal synchronization graph modeling mechanism is proposed. • GWN [16]: Graph WaveNet for deep spatiotemporal graph modeling. A graph convolutional architecture, which proposes an adaptive graph to capture spatial correlations and uses diffusion convolution to capture temporal relationships, is suggested. • AGCRN [24]: An adaptive graph convolutional recursive network for traffic volume prediction. This modifies commonly used graph convolutions through node-adaptive parameter learning and adaptive graph-generation modules, and combines graph convolution with GRU to explore spatiotemporal correlations in data. • ASTGNN [30]: The learning dynamics and heterogeneity of spatiotemporal map data for traffic prediction. This model adopts a self-attention mechanism to capture features in both temporal and spatial dimensions.

Experiment Settings
All experiments in this paper were conducted on a machine equipped with NVIDIA GeForce 3060ti and 16 GB of RAM. The models in this paper were implemented using Windows 11, Py-Torch 1.17, and Python 3.9. Similar training settings to those in [33,34] were used, and were trained using an Adam optimizer with learning rate of 0.001, a batch size of 32, and an epoch of 100, and using an early stopping strategy to prevent overfitting, both in training the baseline model and the model proposed in this paper. The layer of DSTGN was set to 3, and the embedding dimension was set to 128. The convolution kernels k 1 , k 2 and k 3 of the gated temporal convolution were set to 3, 5, and 8, respectively, and the number of heads of spatial multi-head attention was set to 8.

Performance Comparison
We used three widely used metrics: mean absolute error (MAE), root-mean-square error (RMSE), and mean absolute percentage error (MAPE), to measure the predictive performance of the model. We compared the predictive performance of the proposed model with the baseline model on the PeMS04 and PeMS08 datasets. Table 2 shows the average performance of the model presented in this paper and the baseline model over the next hour. Our model achieves the best performance at different moments. (1) First, it can be found that both our model and the deep-learning model are ahead of the traditional methods HA and ARIMA, something which shows that deep learning is very effective in modeling time-series forecasting. (2) Compared with the deep-learning model LSTM which only models the time dimension, our model and some deep-learning models which consider graph structure information are ahead, indicating a strong spatial-temporal dependency in the spatial-temporal sequence. (3) Our model and the self-attention-based ASTGCN and AASTGN models outperform recurrent networks such as DCRNN and LSTM in terms of performance, suggesting that capturing dynamic spatial-temporal correlations is highly necessary. (4) Our model and GWN which consider the graph relationship are better than the graph model DCRNN and STSGCN in terms of effect, indicating that the adjacency graph constructed using only Euclidean distance cannot reflect the real spatial relationship, and that exploring the real node relationship can improve model performance. (5) Compared with GWN and AGCRN, which only consider static node relationships, our model also considers dynamic node relationship changes, learns time-varying spatial characteristics, and demonstrates more powerful performance. (6) Compared with models such as ASTGCN and ASTGNN which use self-attention mechanisms, our model uses spatial attention to interact with the adaptive graph structure proposed in this paper and adaptively selects relevant nodes for spatial attention calculations. The performance of the model is improved, and the effectiveness of considering node relationships is further demonstrated. (7) Compared with the baselines, our model has a significant lead in the long-term prediction effect. Tables 3 and 4, Figure 4 show the changes which occur in the prediction performance of various methods as the prediction interval increases in the two datasets.    (a) To understand and evaluate the predictive performance of the model more intuitively, we visualized the predictive effect of the model. We chose sensors No. 7 and No. 157 in the PeMS08 data set and took the real data for 24 h a day from the No. 7 sensor to visualize the prediction effect of STSGCN, ASTGCN, and the model proposed in this paper. It can be seen from Figure 5a that STSGCN has a lower degree of fitting with ASTGCN, and DSTGN has a closer effect on the prediction of traffic flow than the accurate prediction. Compared with the other two models, the predicted situation of DSTGN is more in line with the actual situation during the two periods of 3:00 pm to 6:00 pm and 6:00 pm to 8:00 pm. At the same time, while our model is good at capturing the inherent patterns of time series, it can also effectively avoid over-fitting problems. For example, in Figure 5b we visualize the accurate and predicted traffic flow data recorded by sensor No. 157 from 19 August 2016 to 23 August 2016. It can be seen intuitively that the sensor generally reaches the low peak of the day at around 2:00 pm and reaches the highest peak around 2:00 pm. At 2:00 pm on 19 August 2016, the traffic flow suddenly had an abnormal increase. That night it suddenly decreased abnormally, resulting in a low peak at around 2:00 am on 20 August 2016, compared to other days, with lower peaks and valleys. However, our model did not firmly fit the abnormal changes in that day's data. Our model achieves impressive predictions, but some local predictions may need improvement due to random noise.

Ablation Experiment
For this section, we conducted ablation experiments on the adaptive graph structure on the PeMS-04 dataset to verify the effectiveness of the model. First, we defined three different graph structures: static matrix (S), dynamic information matrix (D), and adjacency matrix (A). For the above three different graph structures, we made different combinations. Figure 6 shows the detailed results of the model's average results for one hour and predictions for each time slice under different graph structure combinations. In the prediction effect of PeMS04 data set, the combination of static topology matrix and dynamic information matrix shows the best effect, while the case of only considering the adjacency graph is the worst, which further shows that only using the distance of each node is not enough to judge the relationship strength of nodes; further exploring the effective node relationships is very effective. Based on the adjacency graph, adding dynamic graph information leads to the effect being greatly improved, something which shows that it is very effective to construct a dynamic graph by introducing traffic similarity and further proves the necessity of capturing the dynamic characteristics of node traffic. The graph structure composed solely of static graphs has a greatly improved effect compared with the adjacency matrix, which shows the effectiveness of mining hidden relationships between nodes. The adaptive graph structure composed of static graphs and dynamic graphs works best, further proving the necessity of capturing hidden spatial relationships

Ablation Experiment
For this section, we conducted ablation experiments on the adaptive graph structure on the PeMS-04 dataset to verify the effectiveness of the model. First, we defined three different graph structures: static matrix (S), dynamic information matrix (D), and adjacency matrix (A). For the above three different graph structures, we made different combinations. Figure 6 shows the detailed results of the model's average results for one hour and predictions for each time slice under different graph structure combinations. In the prediction effect of PeMS04 data set, the combination of static topology matrix and dynamic information matrix shows the best effect, while the case of only considering the adjacency graph is the worst, which further shows that only using the distance of each node is not enough to judge the relationship strength of nodes; further exploring the effective node relationships is very effective. Based on the adjacency graph, adding dynamic graph information leads to the effect being greatly improved, something which shows that it is very effective to construct a dynamic graph by introducing traffic similarity and further proves the necessity of capturing the dynamic characteristics of node traffic. The graph structure composed solely of static graphs has a greatly improved effect compared with the adjacency matrix, which shows the effectiveness of mining hidden relationships between nodes. The adaptive graph structure composed of static graphs and dynamic graphs works best, further proving the necessity of capturing hidden spatial relationships and flow characteristics between various nodes. In short, the design of each graph has a positive impact on performance improvement.

Model Efficiency Study
In this section, we compared the computational efficiency of the model with the training time and inference time, and the results are shown in Table 5. AASTGN obtains the optimal computational efficiency. Unlike DCRNN which uses a recurrent network, AASTGN can directly generate all predictions so the running speed is faster than DCRNN. STSGCN obtains the space-time graph information for adjacent time steps for modeling, and needs to be calculated at each time step, while AASTGN uses adaptive node embedding and learns the dynamic characteristics of node information, and can learn faster. ASTGCN and ASTGNN use the self-attention mechanism, which results in a substantial increase in computing speed. Compared with the previous two models, DGSTN has better computational costs for improved adaptive spatial attention and multi-scale temporal convolution.

Model Efficiency Study
In this section, we compared the computational efficiency of the model with the training time and inference time, and the results are shown in Table 5. AASTGN obtains the optimal computational efficiency. Unlike DCRNN which uses a recurrent network, AASTGN can directly generate all predictions so the running speed is faster than DCRNN. STSGCN obtains the space-time graph information for adjacent time steps for modeling, and needs to be calculated at each time step, while AASTGN uses adaptive node embedding and learns the dynamic characteristics of node information, and can learn faster. ASTGCN and ASTGNN use the self-attention mechanism, which results in a substantial increase in computing speed. Compared with the previous two models, DGSTN has better computational costs for improved adaptive spatial attention and multi-scale temporal convolution.

Research on the Validity of Static Topology Graph
To visualize the effectiveness of the static topology map, we selected the top 50 sensors in the PeMS04 dataset as our research focus. Figure 7a shows the sensor correlation heat map of the adjacency matrix graph, and Figure 7b shows the sensor correlation heat map of the static topology graph. Comparing the two heat maps, it can be seen that the static topology map has been adjusted many times on the basis of the adjacency matrix. Static topology maps learn from predefined mappings, thus preserving some of their basic characteristics. However, unlike the predefined graph, the adaptive graph learns some hidden relationships in the road network structure; for example, static topology weakens the relationship between sensor 36 and sensor 47 and enhances the influence of sensor 19 on sensor 36. In this case, sensor 19 and sensor 36 are not directly connected geographically, but adaptive graph learning reveals a strong hidden correlation between them. We visualized the traffic flow curves of sensor 19 and sensor 36 within a day, and it can be seen from the graph that sensor 19 and sensor 36 have a high spatial correlation with each other. This indicates that the adjacency matrix graph cannot express the true node dependency because two sensors with close geographical locations may not always have strong dependencies, which further indicates that our static topology graph can find the hidden spatial dependencies in the road network.

Research on the Validity of Static Topology Graph
To visualize the effectiveness of the static topology map, we selected the top 50 sensors in the PeMS04 dataset as our research focus. Figure 7a shows the sensor correlation heat map of the adjacency matrix graph, and Figure 7b shows the sensor correlation heat map of the static topology graph. Comparing the two heat maps, it can be seen that the static topology map has been adjusted many times on the basis of the adjacency matrix. Static topology maps learn from predefined mappings, thus preserving some of their basic characteristics. However, unlike the predefined graph, the adaptive graph learns some hidden relationships in the road network structure; for example, static topology weakens the relationship between sensor 36 and sensor 47 and enhances the influence of sensor 19 on sensor 36. In this case, sensor 19 and sensor 36 are not directly connected geographically, but adaptive graph learning reveals a strong hidden correlation between them. We visualized the traffic flow curves of sensor 19 and sensor 36 within a day, and it can be seen from the graph that sensor 19 and sensor 36 have a high spatial correlation with each other. This indicates that the adjacency matrix graph cannot express the true node dependency because two sensors with close geographical locations may not always have strong dependencies, which further indicates that our static topology graph can find the hidden spatial dependencies in the road network.

Conclusions
This paper proposes a new traffic flow prediction model called DGSTN. DGSTN introduces a set of adaptive graphs, including static topological graphs that can explore accurate spatial correlations and dynamic information graphs that explore dynamic traffic features. The method mines the features between nodes to characterize genuine traffic node relationships. The model is a spatial-temporal module consisting of multi-scale gated convolution and adaptive spatial attention for exploring accurate spatial-temporal correlations. An empirical study on two traffic datasets shows that DGSTN achieves superior performance. The effectiveness of the components is demonstrated by ablation experiments and visualization of the static topological matrix, which indicate that the model has a very high potential for exploring fundamental spatial-temporal structures. In addition, the graph structure's high flexibility and extensibility indicate that the model is innovative, useful, and practical. We use multi-headed attention to interact with the graph structure to capture the spatial structure. Although we can capture the relevance from a global graph, developing a more lightweight model is necessary because multiheaded attention uses dot products for computation, something which requires enormous computational effort. In the next step, we can further apply the proposed framework to other spatial-temporal sequence prediction tasks, such as the evolution of social networks, weather, and air quality prediction, etc.