Traffic Speed Prediction: An Attention-Based Method

Short-term traffic speed prediction has become one of the most important parts of intelligent transportation systems (ITSs). In recent years, deep learning methods have demonstrated their superiority both in accuracy and efficiency. However, most of them only consider the temporal information, overlooking the spatial or some environmental factors, especially the different correlations between the target road and the surrounding roads. This paper proposes a traffic speed prediction approach based on temporal clustering and hierarchical attention (TCHA) to address the above issues. We apply temporal clustering to the target road to distinguish the traffic environment. Traffic data in each cluster have a similar distribution, which can help improve the prediction accuracy. A hierarchical attention-based mechanism is then used to extract the features at each time step. The encoder measures the importance of spatial features, and the decoder measures the temporal ones. The proposed method is evaluated over the data of a certain area in Hangzhou, and experiments have shown that this method can outperform the state of the art for traffic speed prediction.


Introduction
Traffic speed prediction, especially short-term prediction (less than 20 min), has become increasingly important in intelligent transportation systems (ITSs) [1]. Many modern traffic facilities and applications rely heavily on the accuracy of prediction. For example, the navigation system can provide an optimal route for travelers based on real-time prediction, and can calculate the cost of travel time, which is helpful for making plans. The traffic speed can reflect the traffic state of the road network; based on the current value of traffic speed and its future short-term change trend, managers can partition the traffic network [2], optimize the signal timing, and guide the traffic traveling, so as to make full use of road resources and alleviate traffic congestion.
Because of the substantial amount of potential, traffic prediction has become a hot topic in the field of traffic over the past few decades. Considering that the current traffic states are relevant to the upstream and downstream roads, and are also similar to the same horizon of previous weekdays and weekends, various data-driven algorithms have been proposed to increase the prediction reliability and accuracy. In general, approaches can be categorized into three parts: parametric methods, non-parametric methods, and deep learning methods [3][4][5].
Parametric approaches rely on a fixed parameter set, assuming that the collected data satisfy a similar distribution. A most widely used approach is the auto-regressive integrated moving average model (ARIMA), which is a time-series prediction model and assumes that the data are stationary, that is to say, the mean value and variance remain unchanged. The ARIMA model was first used in traffic prediction in [6], and in the past few decades, a number of its extensions have been proposed [7,8]. Although such method is easy to implement, it can only be used in linear system. The traffic system is far more complex, if using such method, the prediction accuracy is not reliable enough. and the topological structure between the target road and the surrounding road, the historical data are divided into two parts: spatial data and temporal data. Secondly, to distinguish the traffic environment, temporal clustering analysis is applied to the target road [36], which separates the historical data into several clusters. Traffic data in each cluster have similar distribution, which can help improve the prediction accuracy. Thirdly, a hierarchical attention-based mechanism is used to measure the importance of each feature at each time step, the spatial attention in the encoder measures the importance of spatial features, and the temporal attention in the decoder measures the temporal ones. In each module, bi-directional LSTM (BiLSTM) is used to capture further nonlinear information.
The principal contributions of this paper are as follows:

1.
A novel deep learning framework is proposed for short-term traffic speed prediction.

2.
Temporal clustering is used to improve dataset partition for enhancing performance. 3.
Two attention mechanisms are introduced to capture important spatio-temporal information. 4.
The effectiveness of the proposed model is validated in two real-word traffic datasets.

Methodology
With the overview of recent studies on short-term traffic speed prediction, a traffic speed prediction approach based on temporal clustering and hierarchical attention (TCHA) is proposed. Figure 1 shows the framework of this paper. We collected raw traffic speed data from cameras equipped on roads, capturing the passing vehicles and saving information into databases. Certain data cleaning methods were then employed to remove anomaly elements. The third step was to partition the pre-processed data into several clusters using a hierarchical temporal clustering algorithm. Traffic data in each cluster has a similar distribution, which can help improve the prediction accuracy. Following the above steps, two traffic speed vectors, which contain temporal speeds and spatial speeds respectively, were generated, and a hierarchical attention-based method was then applied to these two vectors to capture spatial and temporal features. In the encoder, spatial vectors were taken as inputs, and the relevance of each selected road was determined with the spatial attention. The hidden states computed from the encoder together with the temporal vectors were concatenated as inputs for the decoder. In the decoder, the importance of each time step was calculated with the temporal attention. Finally, a fully connected layer was used for prediction. Each part is detailed in the following subsections. information and the topological structure between the target road and the surrounding road, the historical data are divided into two parts: spatial data and temporal data. Secondly, to distinguish the traffic environment, temporal clustering analysis is applied to the target road [36], which separates the historical data into several clusters. Traffic data in each cluster have similar distribution, which can help improve the prediction accuracy. Thirdly, a hierarchical attention-based mechanism is used to measure the importance of each feature at each time step, the spatial attention in the encoder measures the importance of spatial features, and the temporal attention in the decoder measures the temporal ones. In each module, bi-directional LSTM (BiLSTM) is used to capture further nonlinear information.
The principal contributions of this paper are as follows: 1. A novel deep learning framework is proposed for short-term traffic speed prediction. 2. Temporal clustering is used to improve dataset partition for enhancing performance. 3. Two attention mechanisms are introduced to capture important spatio-temporal information. 4. The effectiveness of the proposed model is validated in two real-word traffic datasets.

Methodology
With the overview of recent studies on short-term traffic speed prediction, a traffic speed prediction approach based on temporal clustering and hierarchical attention (TCHA) is proposed. Figure 1 shows the framework of this paper. We collected raw traffic speed data from cameras equipped on roads, capturing the passing vehicles and saving information into databases. Certain data cleaning methods were then employed to remove anomaly elements. The third step was to partition the pre-processed data into several clusters using a hierarchical temporal clustering algorithm. Traffic data in each cluster has a similar distribution, which can help improve the prediction accuracy. Following the above steps, two traffic speed vectors, which contain temporal speeds and spatial speeds respectively, were generated, and a hierarchical attention-based method was then applied to these two vectors to capture spatial and temporal features. In the encoder, spatial vectors were taken as inputs, and the relevance of each selected road was determined with the spatial attention. The hidden states computed from the encoder together with the temporal vectors were concatenated as inputs for the decoder. In the decoder, the importance of each time step was calculated with the temporal attention. Finally, a fully connected layer was used for prediction. Each part is detailed in the following subsections.

Data Partition
Traffic environments are changing every day, and some researchers [37] have proved that the traffic environment (or context) dimension is the most relevant to traffic prediction. The context includes the day of a week (weekday or weekend), emergency events (or how far it happens away from the target road), weather (rainy, sunny, etc.), and so on. Accuracy may be low if we take all of the pre-processed speed data as training or testing samples, this fact is quite evident, e.g., a model is highly likely to be unable to detect a dog if it is trained by thousands of cats and tens of dogs. However, traffic environments are complex. There is no clear boundary or auto-adjusted model to partition. In this paper, we apply an unsupervised method, temporal clustering (TC), to partition raw traffic data. Temporal clustering analysis uses hierarchical clustering to obtain several clusters, in

Data Partition
Traffic environments are changing every day, and some researchers [37] have proved that the traffic environment (or context) dimension is the most relevant to traffic prediction. The context includes the day of a week (weekday or weekend), emergency events (or how far it happens away from the target road), weather (rainy, sunny, etc.), and so on. Accuracy may be low if we take all of the pre-processed speed data as training or testing samples, this fact is quite evident, e.g., a model is highly likely to be unable to detect a dog if it is trained by thousands of cats and tens of dogs. However, traffic environments are complex. There is no clear boundary or auto-adjusted model to partition. In this paper, we apply an unsupervised method, temporal clustering (TC), to partition raw traffic data. Temporal clustering analysis uses hierarchical clustering to obtain several clusters, in which all of the traffic speed data have similar traffic variation patterns. Algorithm 1 illustrates the details of this part.
All of the historical traffic speed data are divided by days before clustering, as the input dataset sequence is D = (d 1 , d 2 , . . . , d p ) with d i ∈ R q , where q represents the number of data in one day, and p is the number of samples, i.e., the number of initial cluster. Threshold θ and sim_max is first initialized. sim_max is a constant that represents the maximum value of an integer. In each loop, the similarity within clusters is calculated, and two clusters whose similarity obtains the maximum are aggregated. The whole procedure will be stopped when there is only one cluster, or the maximum similarity is less than the threshold. In this paper, the Pearson correlation coefficient [38] is employed as a similarity function. The data of each cluster can be used to train a prediction model. Before the prediction, the similarity between the current day's data and each cluster is calculated, the closest cluster and its model are selected to predict the short-term traffic speed. which all of the traffic speed data have similar traffic variation patterns. Algorithm 1 illustrates the details of this part. All of the historical traffic speed data are divided by days before clustering, as the input dataset where q represents the number of data in one day, and p is the number of samples, i.e., the number of initial cluster. Threshold θ and sim_max is first initialized. sim_max is a constant that represents the maximum value of an integer. In each loop, the similarity within clusters is calculated, and two clusters whose similarity obtains the maximum are aggregated. The whole procedure will be stopped when there is only one cluster, or the maximum similarity is less than the threshold. In this paper, the Pearson correlation coefficient [38] is employed as a similarity function. The data of each cluster can be used to train a prediction model. Before the prediction, the similarity between the current day's data and each cluster is calculated, the closest cluster and its model are selected to predict the short-term traffic speed.
Input: traffic speed data D divided by days initialize threshold as θ, sim_max as INT_MAX 3.
initialize sim with empty array

5.
for each two clusters: i c , j c ∈D: 10. end while 11. end procedure

The Attention Model
The attention model is aimed to determine how strong the target road's speed is relative to several time steps and the surrounding roads. As we know, the historical traffic speed data at a closer time step and the surrounding road will have a greater impact on future speed data [39,40], but the influence factor may differ in different cases.
There are two attention mechanisms in the model, i.e., the spatial attention and the temporal attention (see Figure 2). The spatial attention is used in the encoder to capture the spatial features and determine the importance of each space point with a BiLSTM network, and the temporal attention is applied in the decoder to capture temporal relations and decide the importance of each time step with another BiLSTM network.

The Attention Model
The attention model is aimed to determine how strong the target road's speed is relative to several time steps and the surrounding roads. As we know, the historical traffic speed data at a closer time step and the surrounding road will have a greater impact on future speed data [39,40], but the influence factor may differ in different cases.
There are two attention mechanisms in the model, i.e., the spatial attention and the temporal attention (see Figure 2). The spatial attention is used in the encoder to capture the spatial features and determine the importance of each space point with a BiLSTM network, and the temporal attention is applied in the decoder to capture temporal relations and decide the importance of each time step with another BiLSTM network. which all of the traffic speed data have similar traffic variation patterns. Algorithm 1 illustrates the details of this part. All of the historical traffic speed data are divided by days before clustering, as the input dataset where q represents the number of data in one day, and p is the number of samples, i.e., the number of initial cluster. Threshold θ and sim_max is first initialized. sim_max is a constant that represents the maximum value of an integer. In each loop, the similarity within clusters is calculated, and two clusters whose similarity obtains the maximum are aggregated. The whole procedure will be stopped when there is only one cluster, or the maximum similarity is less than the threshold. In this paper, the Pearson correlation coefficient [38] is employed as a similarity function. The data of each cluster can be used to train a prediction model. Before the prediction, the similarity between the current day's data and each cluster is calculated, the closest cluster and its model are selected to predict the short-term traffic speed.
Algorithm 1: Temporal clustering analysis. Input: traffic speed data D divided by days initialize threshold as θ, sim_max as INT_MAX 3.
initialize sim with empty array

5.
for each two clusters: i c , j c ∈D: 10. end while 11. end procedure

The Attention Model
The attention model is aimed to determine how strong the target road's speed is relative to several time steps and the surrounding roads. As we know, the historical traffic speed data at a closer time step and the surrounding road will have a greater impact on future speed data [39,40], but the influence factor may differ in different cases.
There are two attention mechanisms in the model, i.e., the spatial attention and the temporal attention (see Figure 2). The spatial attention is used in the encoder to capture the spatial features and determine the importance of each space point with a BiLSTM network, and the temporal attention is applied in the decoder to capture temporal relations and decide the importance of each time step with another BiLSTM network.   The encoder is essentially a BiLSTM network [41], aiming to determine the importance of each space point. Given the input spatial sequences at time step t, S = (S t−l+1 , S t−l+2 , . . . , S t T ∈ R l×n , where n is the number of selected surrounding roads, and l is the time lags. The spatial matrix can also be written as S = (S 1 , S 2 , . . . , S n , where S i ∈ R l represents the speed matrix of road i at all time steps.
The spatial attention mechanism can be constructed by a soft attention mechanism: where h e,t−1 ; c t−1 ; S i ∈ R 2m+l is a concatenation of the previous encoder hidden state, memory cell, and current spatial data, m is the encoder hidden size, Z e ∈ R l and W e ∈ R l×(2m+l) are the weights of linear functions, b e ∈ R l and b ze ∈ R n are the bias terms, which are all the parameters to learn, c t and h e,t ∈ R m are memory cells and the linear transformation of hidden states in the encoder procedure, which are initialized as zero tensors and will be illustrated in detail, and tanh(·) is the hyperbolic tangent function. A SoftMax function is applied to compute the spatial attention weights α t ∈ R n , which represent the scores of each selected road, with the higher score representing the stronger relation.
With the attention weights, the input spatial matrices can be transferred to where S t contains spatial information.
To extract further features and learn parameters, an activation function should be applied. In this paper, we use BiLSTM, which is specialized for sequence learning. BiLSTM contains the forward LSTM (denoted as → LSTM), which processes spatial data from S t−l+1 to S t , and the backward ← LSTM (LSTM), which processes spatial data from S t to S t−l+1 .
The outputs of BiLSTM can be expressed as follows: where → h t ∈ R m represents the forward hidden state, and ← h t ∈ R m represents the backward hidden state, both of which capture the deeper information of all inputs at time t. h t ∈ R 2m is the concatenation of → h t and ← h t , which represents the encoder hidden state and will be decoded in temporal attention. h e,t ∈ R m is the linear transformation of h t , which will be used for spatial attention calculation, W e,t ∈ R m×2m and b e,t ∈ R m are weight terms and bias terms.
With the proposed spatial attention, the encoder will focus more on several roads that obtain higher weights.

The Decoder Module
Another BiLSTM network is used in the decoder to determine how strong each time step will influence the predicted traffic speed. With the computed hidden states (h t−l+1 , h t−l+2 , . . . , h t ) in the encoder layer, we calculate the temporal attention weights as follows: where h d,t−1 ; c t−1 ; h i ∈ R 2m+2k is a concatenation of the previous decoder hidden state, memory cell, and encoder hidden state, k is the decoder hidden size, W d ∈ R m×(2m+2k) , Z d ∈ R m , b zd ∈ R l , and b wd ∈ R m are the weights and the bias terms in the decoder, which are all the parameters to learn. The temporal attention weights represent how much each encoder hidden state will influence the prediction results. Since each encoder hidden state has contained the spatial factors, a context vector that represents the sum of all encoder hidden states can be computed in the attention mechanism: Combined with context vectors, the input temporal sequence at time t y t−l+1 , y t−l+2 , . . . , y t can be transferred to where c t y t ∈ R 2m+1 is the concatenation of the target road's speed at time t and context vector; W c ∈ R 2m+1 and b c ∈ R are parameters to learn, which help map the concatenation to new inputs in the decoder. Similarly, we apply BiLSTM to extract further features in the time dimension. The outputs can be expressed as follows: where → h t ∈ R k represents the forward hidden state, and ← h t ∈ R k represents the backward hidden state, both of which capture the deeper information of entire inputs at time t. h t ∈ R 2k is the concatenation of → h t and ← h t , h d,t ∈ R k is the linear transformation of h t , which will be used for temporal attention calculation, W d,t ∈ R k×2k and b e,t ∈ R k are weight terms and bias terms.
Finally, the prediction result can be iteratively computed through the fully connected layer: where c t h t ∈ R 2m+2k is the concatenation of the context vector and the decoder hidden state, W y ∈ R 2m+2k , b y ∈ R are parameters to learn, and y t+1 pred represents the predicted result at time step t + 1.

Model Optimization
As mentioned above, there are parameters to learn with training samples. The Adam [42] optimizer is used to train the model, and the mean square error in Equation (17) is employed as a loss function to measure the difference between the ground truth values and the predicted results: where y i pred is the predicted result, y i true is the ground truth value, and N denotes the number of training samples.

Results
In this section, the traffic speed data collected from Xiaoshan District, Hangzhou, China, are used to demonstrate the effectiveness of the proposed TCHA method, through comparing to several state-of-the-art prediction approaches with deep architectures.

Experimental Setup
As is shown in Figure 3, to evaluate the performance of the proposed TCHA method, part of the Xiaoshan District was chosen to performance experiments (in black), Shixin Road (in red) and Tonghui Road (in green) were selected as target roads, and the remaining segments were used to determine the spatial features. There were 38 detectors along the whole roads, and the traffic speed data were collected and calculated every 5 min. Consequently, one detector preserved 288 data per day. As mentioned before, we partitioned the raw traffic speed data into two vectors; however, due to the malfunction of detectors and the failure of data transmission, there were some incorrect data, which includes the following:

•
Missing data. There are some zero elements in the raw data, which are marked as missing data.

•
Outliers. Considering that the speed limits in Hangzhou are usually lower than 80 km/h, we set the maximum traffic speed as 100 km/h, which means that, if a certain record of speed is higher than the threshold, it is marked as an outlier. • Noisy data. Since it is a real-world traffic speed dataset, dramatic changes should be avoided. Consequently, traffic speeds differing more than 20 km/h between two adjacent time points are considered as noisy data.

Model Optimization
As mentioned above, there are parameters to learn with training samples. The Adam [42] optimizer is used to train the model, and the mean square error in Equation (17) is employed as a loss function to measure the difference between the ground truth values and the predicted results: where i pred y is the predicted result, i true y is the ground truth value, and N denotes the number of training samples.

Results
In this section, the traffic speed data collected from Xiaoshan District, Hangzhou, China, are used to demonstrate the effectiveness of the proposed TCHA method, through comparing to several state-of-the-art prediction approaches with deep architectures.

Experimental Setup
As is shown in Figure 3, to evaluate the performance of the proposed TCHA method, part of the Xiaoshan District was chosen to performance experiments (in black), Shixin Road (in red) and Tonghui Road (in green) were selected as target roads, and the remaining segments were used to determine the spatial features. There were 38 detectors along the whole roads, and the traffic speed data were collected and calculated every 5 min. Consequently, one detector preserved 288 data per day. As mentioned before, we partitioned the raw traffic speed data into two vectors; however, due to the malfunction of detectors and the failure of data transmission, there were some incorrect data, which includes the following:  Missing data. There are some zero elements in the raw data, which are marked as missing data.  Outliers. Considering that the speed limits in Hangzhou are usually lower than 80 km/h, we set the maximum traffic speed as 100 km/h, which means that, if a certain record of speed is higher than the threshold, it is marked as an outlier.  Noisy data. Since it is a real-world traffic speed dataset, dramatic changes should be avoided. Consequently, traffic speeds differing more than 20 km/h between two adjacent time points are considered as noisy data.  The traffic speed data marked as any type of anomaly data are replaced by the average speed of the previous 10 min. In general, the proportion of anomaly records is less than 10%. Figure 4 is a plot of several typical traffic speeds over time, which demonstrates obvious period patterns. Different distributions of traffic speed data in different clusters after employing TC can be seen in Figure 5. It is clear that the traffic speed data in the same cluster have similar distributions, while the distribution may differ more when the data come from two clusters. The traffic speed data marked as any type of anomaly data are replaced by the average speed of the previous 10 min. In general, the proportion of anomaly records is less than 10%. Figure 4 is a plot of several typical traffic speeds over time, which demonstrates obvious period patterns. Different distributions of traffic speed data in different clusters after employing TC can be seen in Figure 5. It is clear that the traffic speed data in the same cluster have similar distributions, while the distribution may differ more when the data come from two clusters.  The TCHA method is evaluated on different prediction horizons, and the time lag l is set as 12. Prediction horizons are set up to 5, which means that 60 min of historical traffic speed data are used to predict speed of the following 25 min. For example, suppose the current time is 7:00 a.m.: The proposed method will predict the speed of 7:05 a.m., 7:10 a.m., 7:15 a.m., 7:20 a.m., and 7:25 a.m.
The proposed TCHA method is compared to several state-of-the-art prediction approaches, which includes the following:  support vector regression (SVR) [43], which uses linear support vector machine for regression tasks, especially time-series prediction;  stacked autoencoder (SAE) [21], which encodes the inputs into dense or sparse representations by using multi-layer autoencoders;  long short-term memory (LSTM) [44], which is an extension of recurrent neural networks (RNNs) and has an input gate, a forget gate, and an output gate so as to deal with the long-term The traffic speed data marked as any type of anomaly data are replaced by the average speed of the previous 10 min. In general, the proportion of anomaly records is less than 10%. Figure 4 is a plot of several typical traffic speeds over time, which demonstrates obvious period patterns. Different distributions of traffic speed data in different clusters after employing TC can be seen in Figure 5. It is clear that the traffic speed data in the same cluster have similar distributions, while the distribution may differ more when the data come from two clusters.  The TCHA method is evaluated on different prediction horizons, and the time lag l is set as 12. Prediction horizons are set up to 5, which means that 60 min of historical traffic speed data are used to predict speed of the following 25 min. For example, suppose the current time is 7:00 a.m.: The proposed method will predict the speed of 7:05 a.m., 7:10 a.m., 7:15 a.m., 7:20 a.m., and 7:25 a.m.
The proposed TCHA method is compared to several state-of-the-art prediction approaches, which includes the following:  support vector regression (SVR) [43], which uses linear support vector machine for regression tasks, especially time-series prediction;  stacked autoencoder (SAE) [21], which encodes the inputs into dense or sparse representations by using multi-layer autoencoders;  long short-term memory (LSTM) [44], which is an extension of recurrent neural networks (RNNs) and has an input gate, a forget gate, and an output gate so as to deal with the long-term The TCHA method is evaluated on different prediction horizons, and the time lag l is set as 12. Prediction horizons are set up to 5, which means that 60 min of historical traffic speed data are used to predict speed of the following 25 min. For example, suppose the current time is 7:00 a.m.: The proposed method will predict the speed of 7:05 a.m., 7:10 a.m., 7:15 a.m., 7:20 a.m., and 7:25 a.m.
The proposed TCHA method is compared to several state-of-the-art prediction approaches, which includes the following: • support vector regression (SVR) [43], which uses linear support vector machine for regression tasks, especially time-series prediction; • stacked autoencoder (SAE) [21], which encodes the inputs into dense or sparse representations by using multi-layer autoencoders; • long short-term memory (LSTM) [44], which is an extension of recurrent neural networks (RNNs) and has an input gate, a forget gate, and an output gate so as to deal with the long-term dependency and gradient vanishing/explosion problems; • the gated recurrent unit (GRU) [44], which has an architecture similar to LSTM but only has two gates, a reset gate, and an update gate, which makes GRU have fewer tensor operations than LSTM; • the hierarchical attention model (HA), which uses spatial and temporal attention mechanisms to capture spatial and temporal features respectively, but TC is not employed to the input data, and it is different from the proposed TCHA.
To guarantee the fairness of experiments, all of the approaches are trained by the Adam optimizer, which updates the parameters with a gradient descent algorithm, and the batch size is set as 128. The threshold θ of TC is another hyper-parameter, which is set as 0.6 [36]. All of the neural networks are built in the pyTorch framework. For support vector regression, radial basis function (RBF) is applied as a kernel function for its better performance in a non-linear situation.

Evaluation Criteria
The mean absolute error (MAE), mean relative error (MRE), and the root-mean-squared error (MRSE) are employed as the evaluation criteria to evaluate the prediction accuracy, which are defined as follows: where n denotes the number of prediction samples, y i pred is the prediction value, and y i true is the true value. MSE and MAE can measure the absolute differences between the predicted and true values, and MRE can measure the relative ones. With the increase in prediction horizons, the errors become larger, whereas the TCHA still reaches the best performance, which implies that long-term prediction is more difficult and challenging, and confirms that extracting temporal and spatial features in encoder and decoder mechanisms is reasonable.  Figure 6 shows the curves of the predicted results corresponding to the ground truth on both roads. Different colors represent different algorithms, during the entire day. The proposed TCHA algorithm fits the true speed data whether it is peak hour or not. In Figure 6b, the difference between the ground truth and the results of the proposed method is relatively large around 15 h, which may be due to the large variation when traffic speed meets peak hour. Peak hour and emergencies are challenges when making short-term predictions. The large influence can be seen more clearly when using GRU to make prediction. GRU also fits the general trend of traffic speed in one day, but when traffic speed meets large variation, the prediction performance gets lower, which further emphasizes the importance of spatial and temporal mechanisms, and demonstrates that our proposed model is good approximate to the ground truth. roads. Different colors represent different algorithms, during the entire day. The proposed TCHA algorithm fits the true speed data whether it is peak hour or not. In Figure 6b, the difference between the ground truth and the results of the proposed method is relatively large around 15 h, which may be due to the large variation when traffic speed meets peak hour. Peak hour and emergencies are challenges when making short-term predictions. The large influence can be seen more clearly when using GRU to make prediction. GRU also fits the general trend of traffic speed in one day, but when traffic speed meets large variation, the prediction performance gets lower, which further emphasizes the importance of spatial and temporal mechanisms, and demonstrates that our proposed model is good approximate to the ground truth.  Figure 7 shows the attention scores learned by the hierarchical attention mechanism described in Section 2.2. The darker color indicates a higher importance when making predictions, while the lighter part indicates a lower importance. Rows represent the parts of all the prediction points, and the columns in the two subfigures represent the time lag and space point importance, respectively. In the temporal dimension, closer time lags contribute more to prediction, which matches our intuition that average traffic speed is always continuous and will not have dramatic changes compared to the speed at adjacent times. When time lag is 1, the temporal score is always highest, which may reach 0.87 in some cases. Generally, traffic speed at the farthest time lag (more than 50 min) has little or almost no contribution to future prediction. While in the spatial dimension, it is harder to find the obvious and significant regularity, the difference between the largest attention score and the lowest score is only 0.17. However, the proposed TCHA model still achieves higher weights with respect to the upstream and downstream roads than with respect to the indirect surrounding roads.  Figure 7 shows the attention scores learned by the hierarchical attention mechanism described in Section 2.2. The darker color indicates a higher importance when making predictions, while the lighter part indicates a lower importance. Rows represent the parts of all the prediction points, and the columns in the two subfigures represent the time lag and space point importance, respectively. In the temporal dimension, closer time lags contribute more to prediction, which matches our intuition that average traffic speed is always continuous and will not have dramatic changes compared to the speed at adjacent times. When time lag is 1, the temporal score is always highest, which may reach 0.87 in some cases. Generally, traffic speed at the farthest time lag (more than 50 min) has little or almost no contribution to future prediction. While in the spatial dimension, it is harder to find the obvious and significant regularity, the difference between the largest attention score and the lowest score is only 0.17. However, the proposed TCHA model still achieves higher weights with respect to the upstream and downstream roads than with respect to the indirect surrounding roads.

Conclusions
In this paper, we propose a traffic speed prediction approach based on temporal clustering and hierarchical attention (TCHA) for traffic speed prediction. We first divided the historical traffic speed data into several clusters based on the similarity function of the Pearson correlation coefficient. A spatial attention mechanism was then designed in the encoder, which can adaptively select the relatively important segments for prediction, and a temporal attention mechanism was used in the decoder to determine the importance of each time step. The BiLSTM network was used in both