SAX-STGCN: Dynamic Spatio-Temporal Graph Convolutional Networks for Traffic Flow Prediction

Accurate, timely, and reliable traffic flow prediction is essential for an intelligent transportation system due to the complex spatio-temporal correlation of traffic flow. The prediction model based on graph convolution neural network (GCN) has become mainstream in recent years. However, most of the prediction models based on GCN only use an adjacency matrix to characterize the spatial correlation of traffic flow, ignoring the complex and dynamic relationship between road network adjacent nodes and missing the hidden connection between the global nodes of the road network. This paper proposes a SAX-STGCN network for traffic flow prediction to solve the above problems. The SAX-STGCN model uses symbolic approximation (sax) to obtain the similarity of the historical data of the predicted node in the previous period, including adjacent nodes and non-adjacent nodes, forming a similarity matrix to replace the original adjacency matrix composed of 0 and 1, which is defined as a global sax-correlation matrix to characterize the correlation between nodes in the road network and capture the implicit spatial relationships in the road network. Then, based on the dynamic global sax-correlation matrix, GCN is used to capture the spatial correlation of traffic flow, and gated recurrent unit (GRU) is used to capture the temporal correlation of traffic flow. The prediction accuracy is better than the baseline and has long-term prediction ability.


I. INTRODUCTION
With the development of urbanization, the rapid growth of the urban population and motor vehicles has brought tremendous pressure on urban transportation. The urban traffic control system based on ITS has gradually developed from passive to active management. Active management's core is responding to possible traffic conditions in advance through traffic control and guidance after sensing the changes in urban traffic. Traffic flow prediction is the first step to achieving active management, and it is also a necessary condition for various applications such as traffic control [1], path optimization [2], and vehicle scheduling [3]. At the same time, it provides a decision-making basis for traffic control and risk assessment The associate editor coordinating the review of this manuscript and approving it for publication was Gang Mei . and determines the effectiveness of traffic guidance measures. Therefore, traffic flow prediction has aroused great interest in researchers.
However, traffic flow prediction has always been challenging due to traffic flow data's highly complex temporal and spatial dependence. The previous prediction models mainly predict the future traffic state from the perspective of temporal correlation by learning the historical traffic characteristics. These characteristics are primarily manifested as proximity, periodicity, and randomness. The spatial correlation of traffic flow is mainly displayed in transitivity and spatial similarity. Transport performance in the upstream road traffic will transfer to the downstream road to a certain extent, and the downstream road traffic condition will also affect the upstream road [4]. At the same time, due to the diffusion of the road network, the traffic condition of a particular section As shown in Figure 2, these six nodes are adjacent in the road network, node 1, node 3, and node 6 continue to show similar sequence characteristics, and node 2 and node 5 show other sequence characteristics. In contrast, node 4 is opposite to node 2 and node 5 in time 4 but shows similar characteristics in time 5. Therefore, the fixed adjacency matrix of 0 and 1 is challenging to characterize the influence size and dynamic node correlation between adjacent road network nodes. It is also challenging to capture the hidden spatial relationship. Given the different correlation magnitudes and the dynamic changes of correlations between nodes, this paper proposes a symbolic aggregation representation spatiotemporal graph convolution model (SAX-STGCN). The model performs symbolic aggregation on the historical traffic flow data before the predicted node. It uses its sequence similarity to reconstruct the adjacency matrix to characterize the influence between the road network nodes to improve prediction accuracy. The main contributions of this paper are as follows: 1) For the correlation of adjacent nodes, it is proposed to use sax (Symbolic Aggregate Approximation) to symbolize the traffic state sequence and then use the similarity to represent the node correlation. 2) Aiming at the implicit spatial correlation in the road network, the sax method is used to represent the traffic state sequence of non-adjacent nodes in the road network to represent the implicit spatial correlation in the road network. 3) Given the dynamic change of the correlation between the road network nodes, the two methods (1) and (2) are combined. The traffic state data of the previous period of the predicted node is used to construct a dynamically changing global sax-correlation matrix, which is used to dynamically represent the spatial correlation of adjacent nodes and implicit spatial correlation. Use the graph convolutional neural network that introduces the dynamic global sax-correlation matrix to capture the spatial correlation, and then use the GRU to capture the temporal characteristics of the traffic state to form a SAX-STGCN model to obtain the final prediction result.

II. RELATED WORK
Traffic flow forecasting research has a long history. Forecasting methods are mainly divided into statistical models, VOLUME 10, 2022 machine learning, and deep learning. Statistical models mainly include the historical average method (HA), auto regressive integrated moving average (ARIMA) [9] and its variants [10], [12]. These models have the advantages of low computational complexity but rely on the assumption of data stability. The actual traffic flow data often have strong randomness and complex variability, so the effect in practical application is not ideal. Machine learning models can further model complex data, such as support vector regression (SVR) [13], K-nearest neighbor (KNN) [14], etc. However, the performance of such models depends on the design of feature engineering, and the overall nonlinearity is limited. With the rapid progress of deep learning technology, various deep learning models are used to alleviate these limitations. Early work included multi-layer perceptron (MLP) [15] and deep belief network (DBN) [16], which improved prediction accuracy. Subsequently, models such as recurrent neural networks (RNNs), long short-term memory networks (LSTMs) [17], and gated recurrent units (GRUs) [18] have emerged, widely used in studying time series such as traffic flow prediction. It is worth mentioning that EnLSTM-WPEO [19] uses ensemble learning of diverse LSTMs and PEO-based NNCT weight integration, which effectively improves the prediction accuracy and alleviates the time lag problem. However, these models ignore the spatial correlation of traffic flow data only from the perspective of temporal correlation.
From spatio-temporal correlation, some scholars have used convolutional neural networks (CNNs) [20] and deeper residual networks (ResNets) [21] in computer vision for traffic flow prediction. CNN is usually used for traffic flow prediction by grid division to establish a regional-level spatial matrix of traffic flow, stack the spatio-temporal matrix of traffic flow according to time, and predict using the time series model. The spatio-temporal correlation of traffic flow are comprehensively considered. The representative work includes STResNet [22], CNN-LSTM [23], FCL-Net [24], etc.
Although CNN can model the spatial correlation of traffic flow to a certain extent, the road network is a typical graph structure. Its spatial topology is often lost when using grid processing, so it can not describe the spatial-temporal correlation of the road network in essence. With the rise of graph depth learning, GCN has become the most advanced method in many fields. GCN can effectively extract the dependency between road network nodes as a natural graph structure. Therefore, many GCN-based models have been widely used in traffic flow prediction in the past few years, such as AST-GCN [25], T-GCN [4], AST-GCN [26], TGC-LSTM [27], MRA-BGCN [28], and other models have achieved good results in traffic prediction. However, in these models, the Laplacian matrix of graphs is strictly defined by GCN [29]. The adjacency matrix of the fixed graph composed of 0 and 1 is used, and only the connectivity between nodes is considered. Some scholars use threshold Gaussian Kernel to process the distance matrix of adjacent road network nodes and construct the distance adjacency matrix based on adjacent road network nodes, such as DNRNN [30], KW-GCN [31], etc. The distance matrix is more expressive than the connected matrix. Nevertheless, it is challenging to characterize the spatial correlation of dynamic traffic flow as a fixed matrix.
Therefore, this paper proposes to dynamically construct the graph structure of the road network according to the input traffic flow features instead of the original adjacency matrix. The sax method is used to describe the flow correlation between adjacent nodes. At the same time, to capture the hidden spatial relationship for non-adjacent nodes, the sax method is also used to describe the spatial correlation. A dynamic global sax-correlation matrix is constructed to characterize the spatial correlation better. It can enhance the expressive ability of the model, thereby improving the prediction accuracy.

A. PROBLEM DEFINITION
In this paper, we take the traffic speed prediction as an example to illustrate our work.
Definition 1: Road network G. We describe the road network as a graph represents the set of nodes in the road network, and N represents the number of nodes in the road network. Nodes can be sensors deployed on the road network or road intersections or sections. E represents the set of edges, and if node v i is adjacent to node v j , there is edge e ij between them; A ∈ R N ×N represents the adjacency matrix of the road network topology graph. The adjacency matrix can be represented as 0 and 1 of binary according to different tasks, that is, 0 means that there is no connection between the two nodes while 1 means connectivity.
Definition 2: Traffic speed matrix X T ×N . T represents the length of the historical speed sequence, N represents the number of nodes, X t i represents the speed of node i at the time t, X i represents the historical speed sequence of node i, X t represents the speed characteristics of all nodes in the time t road network G.
Definition 3: Traffic forecasting task. The essence of traffic prediction is to learn a mapping function f (·), and map the traffic state of the future Q time steps through the given historical traffic state of the road structure G and P time steps, as shown in the Figure 3, and Eq.1: There are few studies using time series similarity to characterize the spatial correlation of road network. DC-STGCN [32] directly uses the Pearson correlation coefficient to characterize the correlation of traffic flow at road network nodes, but does not use the time series representation method. Although the complexity is low, it is easily affected by mutation points and cannot well measure the similarity trend of time series [33]. Traffic flow data is specific time series data with complex and high dimensions. Feature extraction and representation of time series data requires dimensionality reduction and denoising based on maintaining and reflecting sequence morphology and basic information as much as possible to reduce the computational cost and improve the efficiency of data processing, mining, and analysis. There are many feature extraction and representation methods [33], [34], [35], among which symbolic aggregate approximation (sax) [36] is considered an atypical and effective symbolic representation method.
In this paper, the traffic flow data are processed by sax, and on this basis, the similarity of traffic flow between nodes is calculated to characterize the correlation between nodes. The construction process of the sax adjacency matrix and global sax correlation matrix is as follows: Assuming that the historical sequence length of the input model is P and the traffic velocity matrix is X ∈ R P×N .
Step 1: Normalize the historical sequence at each node.
where µ i and σ i represent the mean and standard deviation of the traffic speed at sensor i in P time periods, respectively, and X represents the standardized historical traffic speed matrix.
Step 2: PAA(Piecewise aggregate approximation) is performed on the input data, dividing the speed sequence with length P into W segments, and obtaining the speed matrix C ∈ R W ×N obtained after PAA conversion, as shown in Eq.5: where C h i represents the speed detection average value of the h segment of the sensor i, where h = 1, 2, · · · , W .
Step 3: Discrete sequence. Suppose the symbol set is and the number of elements is o. The Gaussian probability density curve is divided into interval [θ k−1 , θ k ], where k = 1, 2, · · · o, and the area under each interval curve is 1 o.
Step 4: Symbolized sequence. Mapping C to the corresponding symbol indicates, as shown in Eq.6: where the symbolized sequence isC.
Step 5: Obtain sequence similarity after symbolization, obtain global sax adjacency matrix: whereC h i −C h j can be obtained by querying the symbol set distance table.
A describes the influence of traffic flow between adjacent nodes from the perspective of time series similarity and is used to describe the implicit spatial relationship between non-adjacent nodes, replacing the original 0 and 1-valued adjacency matrix.

C. FRAMEWORK
The model framework of SAX-STGCN is shown in Figure 4. It consists of the sax Process module and STGCN module, where the sax Process module is used to extract dynamic global sax-correlation matrix, ST Block consists of GCN and GRU, GCN is used to extract spatial features, GRU is used to extract temporal features, and get the final prediction result. VOLUME 10, 2022

D. SPATIAL DEPENDENCE MODELING
Graph neural network (GNN) is a deep learning method mainly used to model graph structure data. It learns the relationship between graph structure and node attributes through information transmission between nodes and neighbors. An essential model of GNN is graph convolutional network (GCN). GCN introduces convolution operation from Euclidean space into a graph structure, which can be expressed in Eq.9-10: where L sym represents the renormalized Laplacian matrix to effectively prevent numerical instability; H (l−1) represents the output of l − 1 layer convolution; θ (l−1) represents the parameters of l − 1 layer convolution.
In this paper, two-layer graph convolution is used to capture spatial correlation:

E. TEMPORAL DEPENDENCE MODELING
Gated Recurrent Unit (GRU) [37] is an extended model for RNN. GRU is generally easier to train than LSTM, and the effect is equivalent, so it is widely used in time series modeling. Unlike LSTM, GRU does not introduce new memory units but controls information access and forgetting by using update gate Z t and reset gate R t . The GRU procedure can be formulated as Eq.12-15: where gc (·) represents the graph convolution operation, W and b are learnable, and H represents the hidden state.

F. SYMBOL AGGREGATION APPROXIMATION SPATIAL-TEMPORAL GRAPH CONVOLUTION NETWORK
In this paper, the dynamic global sax-correlation matrix is constructed using time series similarity to replace the original adjacency matrix. The global sax correlation matrix is used to characterize the implicit spatial relationship in the road network. The specific steps of the SAX-STGCN model are as follows Eq.16-19: where gc (·) represents the graph convolution operation, W and b are learnable parameters, and H represents the hidden state. Traffic flow prediction aims to make the prediction results as close as possible to the actual traffic flow state. The loss function used in this model is as follows: y andŷ represent the actual value and predicted value, respectively, the value of beta is 1.0.

C. PARAMETER SETTINGS
The CPU of the experimental platform adopts Intel Xeon(R) Platinum 8259CL CPU@2.50GHz, and the GPU adopts an NVIDIA Tesla T4 graphics card. During the experiment, the Z-score normalization operation was performed on the data  set, and the data set was divided into 80% training set and 20% test set. The initial learning rate of the SAX-STGCN model is 0.001, the weight decay is 5e-5, and the batch size is 8. SAX-STGCN uses the sax method to calculate the similarity of historical traffic flow data before prediction to obtain a dynamic global sax-correlation matrix. Therefore, selecting the length of historical traffic flow is necessary for constructing the dynamic correlation matrix.
In our experiments, the historical sequence lengths are selected from [4], [8], [12], [16], [20], [24], and the changes in prediction accuracy are analyzed. As shown in Figure 5, the horizontal axis represents the length of the historical sequence, and the vertical axis represents the results of RMSE and MAE corresponding to different lengths. Figure 5(a) shows that the accuracy varies with different historical sequence lengths in the SZ-TAXI dataset. When the historical sequence length is 8, the model performs the best, and the accuracy decreases as the sequence length increases, so the sequence length 8 is chosen. Figure 5(b) shows the variation of accuracy with different historical sequence lengths in the PEMS-BAY dataset, with the best performance when the historical sequence length is 16.
Similarly, Figure 6 shows the effect of the number of hidden layer units on the model performance at different numbers. The horizontal axis represents the number of hidden layer units, and the vertical axis represents the RMSE and MAE values. The number of hidden layer units is selected from [4], [8], [16], [32], [64], [128]. It can be seen from Figures 6(a) and 6(b) that in the SZ-TAXI and PEMS-BAY datasets, the best results are obtained when the number of hidden layer units is both 16.
Finally, we determined that in the SZ-TAXI dataset, the length of the historical sequence is 8, and the number of hidden layer units is 16. In the PEMS-BAY dataset, the length of the historical sequence is selected as 16, and the number of hidden layer units is 16.  flow, and GUR is used to capture the temporal correlation. The number of hidden layer units in the SZ-TAXI data set is set to 100, and the learning rate is set to 0.001. The number of hidden layer units in the PEMS-BAY is set to 64, and the learning rate is set to 0.001. 6) Attribute-Augmented spatiotemporal convolutional graph convolutional network (AST-GCN): It obtains the POI information of nodes in the SZ-TAXI data set, defines historical weather data as external factors, and designs an attribute enhancement unit to encode external factors and integrate them into the spatiotemporal map convolution model. The number of hidden layer units of AST-GCN is set to 100, and the learning rate is set to 0.001. 7) Diffusion convolutional recurrent neural network (DCRNN): It combines a cyclic neural network and diffusion convolution to model traffic flow's inflow and outflow relationship. SZ-TAXI and PEMS-BAY experiments in this paper use the original paper parameter settings. 8) Multi-Range attentive bicomponent graph convolutional network (MRA-BGCN): It constructs a node graph according to the distance of the road network, introduces a multi-range attention mechanism, aggregates information in different neighborhood ranges, and automatically learns the importance of different ranges. The number of BGCGRU layers of MRA-BGCN is set to 2, the number of hidden cells is set to 64, The maximum hop of the bicomponent graph convolution is set to 3, the initial learning rate is set to 0.01, and the attenuation rate is set to 0.6 per 10 epochs. In the PEMS-BAY data set, the above characteristics are also shown. Compared with HA, SVR and ARIMA, the RMSE of SAX-STGCN at the 15-minute prediction horizon is reduced by about 53.8%, 28.1% and 21.8%, respectively, and the MAE is respectively decreased by 48.6% and 18.8% and 32.0%. Compared with T-GCN, MRA-BGCN and DCRNN, MAE is reduced by 7.19%, 5.15% and 12.5%. At the longer 60-minute prediction horizon, the improvements were 12.3%, 3.0%, and 7.2%, respectively.

E. EXPERIMENTAL RESULTS
Through the comparison of the above two datasets, we can find that SAX-STGCN almost achieves the best performance under both RMSE and MAE. SAX-STGCN significantly improved the short-term prediction of the 15-minute prediction horizon. With the increase in the prediction time, the improvement in the 60-minute prediction horizon is slightly reduced. In addition, compared to the smaller SZ-TAXI dataset, SAX-STGCN has a more significant improvement over the larger PEMS-BAY dataset, and we speculate that with the increase in the amount of training data, the model can learn more There are many changes in the traffic status of the road network, so it has been improved.

F. ABLATION EXPERIMENTS
To verify the contribution of the dynamic global sax-correlation matrix to the model, we design ablation experiments. We divide the matrices used to capture spatial correlations into two groups, dynamic or fixed and global or non-global.
Among them, non-global refers to only using sax processing to obtain the similarity of adjacent nodes to replace the original adjacency matrix, while global refers to processing all nodes. The dynamic matrix refers to the adjacency matrix or correlation matrix used to predict the traffic flow next time, which is obtained by sax processing the traffic flow data at the previous period of the prediction node. The fixed matrix refers to the matrix obtained by sax processing the data on the first day of the dataset. The PEMS-BAY dataset contains 288 data points a day and takes 16-time data as a group, totaling 18 groups. The sax-adjacency matrix is obtained respectively, and then the mean value is obtained to obtain the fixed sax-adjacency matrix. SZ-TAXI contains 96 data points daily, divided into 6 groups, and the repeat above operations.
In Table 2, it can be seen from the comparison results of the ablation experiments of the two datasets that compared with the adjacency matrix composed of the original 0 and 1, the matrix processed by the sax method has an impact on the performance of the model. Specifically, among the fixed matrices, the global sax-correlation matrix has a small contribution to the model improvement and even reduces the model performance. On the contrary, using the sax-adjacency matrix improves the model's performance. The dynamic matrix can improve the model, whether the dynamic global sax-correlation matrix or the sax-adjacency matrix. The dynamic global correlation matrix can improve the model the most.
In summary, using the sax method to dynamically calculate the similarity of traffic flow to construct the correlation matrix does improve the model. It also confirms our hypothesis that the correlation between road network nodes changes dynamically with traffic flow. In contrast, the road network's similarity of traffic flow between nodes can represent the correlation of road network nodes. The fixed global sax-correlation matrix shows a negative improvement in the model. We analyze that because it is obtained by the summed average of multiple global sax-correlation matrices, each pair of road network nodes will have a correlation and remain fixed; this does not correspond to reality and thus exhibits a negative lift. In contrast, a fixed sax-adjacency matrix contributes to model performance improvement. Furthermore, it also shows that the sax method does contribute to the model.

G. MODEL INTERPRETATION
In order to better illustrate the prediction performance of this model, a comparative experiment is designed in this paper, the fixed adjacency matrix and the dynamic global sax correlation matrix are brought into the model for comparison, and a road in PEMS-BAY is selected for visual analysis as shown in Figure 7-9, where STGCN means that sax processing is not applicable. Only a fixed adjacency matrix consisting of 0 and 1 is used. We will analyze it from the following three perspectives: 1) Prediction horizons. The prediction results of traffic speed in the next 15, 30, and 60 minutes using historical data of the past hour are shown in Figures 7-9.
The visualization results show that the dynamic sax-correlation matrix model can better fit the actual data in different prediction horizons. Specifically, short-term forecasting performance is better, especially 15 minutes short-term forecasting advantage, is more prominent; by contrast, long-term forecasting performance is weaker than short-term forecasting. We analyze that the reason is that the traffic flow changes dynamically over time, and the correlation between each traffic node also changes over time. Therefore, according to the sax-correlation matrix constructed at the last moment, the traffic flow state cannot be well characterized over time.

V. CONCLUSION
This paper proposes a model for traffic flow prediction called SAX-STGCN. Aiming at the traditional adjacency matrix that only characterizes the connectivity of the road and the fixed adjacency matrix is difficult to characterize the dynamically changing traffic flow graph structure of the road network. This paper uses the sax method to process the traffic flow from the perspective of time series similarity to obtain the global sax-correlation matrix. The matrix replaces the original matrix to characterize node correlations better and capture implicit spatial correlations, and the global sax-correlation matrix changes dynamically with traffic flow. In addition, this paper shows the vital role of the sax method in constructing an adjacency matrix through ablation experiments. Compared with the baseline methods, the prediction results show that SAX-STGCN performs better in different prediction durations. However, the performance of SAX-STGCN in long-term prediction is not as good as in short-term prediction, and it is challenging to predict sudden changes in traffic flow. Our future work will introduce more external features such as weather and significant events. We will use deep learning methods to embed both traffic flow timing data and external features and try to solve these problems.