Temporal Backtracking and Multistep Delay of Traffic Speed Series Prediction

As a typical time series, the length of the data sequence is critical to the accuracy of traﬃc state prediction. In order to fully explore the causality between traﬃc data, this study established a temporal backtracking and multistep delay model based on recurrent neural networks (RNNs) to learn and extract the long-and short-term dependencies of the traﬃc state data. With a real traﬃc data set, the coordinate descent algorithm was employed to search and determine the optimal backtracking length of traﬃc sequence, and multistep delay predictions were performed to demonstrate the relationship between delay steps and prediction accuracies. Besides, the performances were compared between three variants of RNNs (LSTM, GRU, and BiLSTM) and 6 frequently used models, which are decision tree (DT), support vector machine (SVM), k -nearest neighbour (KNN), random forest (RF), gradient boosting decision tree (GBDT), and stacked autoencoder (SAE). The prediction results of 10 consecutive delay steps suggest that the accuracies of RNNs are far superior to those of other models because of the more powerful and accurate pattern representing ability in time series. It is also proved that RNNs can learn and mine longer time dependencies.


Introduction
As an indispensable means of transportation, automobiles have brought unprecedented convenience to our daily life. At the same time, they have also brought many social problems, such as traffic congestion, energy crisis, and air pollution, which have caused widespread concern around the world [1,2]. Modern intelligent transportation systems (ITS) are the practical (perhaps the only) way to solve these increasingly serious problems [3]. Among them, advanced traveller information systems (ATIS) and advanced traffic management systems (ATMS) are playing a huge role in updating route information, forecasting future traffic conditions, reducing traffic congestion, and improving the overall efficiency of the transportation network [4,5]. As the cornerstone of intelligent transportation systems, traffic flow forecasting is crucial to the development of ITS. Accurate and reliable traffic state prediction can help road users and managers to grasp the real-time traffic status and plan travel routes more reasonably. Objectively, it also reduces environmental pollution and improves the management art of smart cities [6].
Due to the subjectivity and stochastic characteristics of travel demand, the state of traffic flow shows different patterns under the influence of various factors [7]. It is not easy to accurately predict future traffic conditions. In recent years, with the rapid spread of traffic sensors, real-time traffic data have been continuously collected and stored. Traffic data plays an important role in assessing traffic conditions and in the application of urban traffic intelligence management [8]. Abundant historical data make it possible to accurately predict future traffic patterns, which have also led to a large number of data-driven traffic prediction studies.
At the same time, traffic flow has a strong time correlation. It repeats the same cycle every day, from the low traffic in the morning to the high traffic during working hours, and finally back to the low traffic in the evening. is periodic feature makes the traffic states data a typical time series, and each data in the series is related to time. Time and sequence characteristics are important attributes that cannot be ignored in traffic state data. By recognizing and utilizing the correlations between traffic patterns and contextual factors [9], the traffic patterns mined from the data sequence will reflect the future of the traffic state.
Among all the time series forecasting methods, deep learning methods generally show better performance. Particularly, recurrent neural networks (RNNs) have more advantages than other models in time series analysis because they can memorize their internal state to process input sequences [10]. Long short-term memory (LSTM) neural networks, as the most popular variant of RNNs, have received much attention in recent years [11]. However, the potential of deep learning methods in traffic forecasting has not yet fully been exploited [12]. However, the LSTM can automatically determine the optimal time lags [13], to obtain better accuracy, finding that the optimal time lags setting largely relies on the trial-and-error method. Obviously, if the data sequence is too short, it is impossible to fully explore the time relationship between the sequences, which directly leads to a decrease in prediction accuracy. However, too many backtracking steps will bring more calculations, but the prediction accuracy will not be significantly improved. How to strike a balance between sequence length and prediction accuracy is one of the concerns of this research.
Another issue studied in this paper is forecasting across multiple time steps. Almost all current researches focus on the prediction of the next time step, and the delay prediction that skips multiple time steps is very rare. Using data and models of the current time interval to predict traffic conditions after multiple time steps (e.g., 2 hours and 3 days) is of great significance in practical applications. It can help travellers and managers to predict traffic conditions in advance and take countermeasures as early as possible. Similarly, the more the time steps span is, the more the prediction error will increase. erefore, another focus of this paper is to study the relationship and impact between the number of delay steps and the prediction accuracy to meet the needs of practical applications and provide inspiration for future research.
In order to fully explore the temporal relationship between traffic sequences, an LSTM which has a stronger and more accurate time series pattern learning capability was established to learn and extract the long-term and shortterm time dependencies. At the same time, the coordinate descent algorithm was used to search and determine the backtracking length of the sequential data. Finally, the traffic state predictions with a multistep delay were performed to evaluate the correlation between the delay steps and the prediction accuracy.
Based on the aforementioned discussion, the contributions of this paper are 3-fold: (1) a traffic sequence prediction model with multistep delay was established, and the optimal length of the backtracking sequence is determined by the coordinate descent algorithm. (2) e approximately linear relationship between prediction accuracy and the delay steps was evaluated and demonstrated. (3) It was found that RNNs can learn and represent longer temporal dependence and are very suitable for time series causality mining. In addition, this research is also of great significance to actual traffic management. e optimization of the number of backtracking steps can improve the accuracy and reliability of traffic prediction and avoid wasting more computing resources. It can also guarantee the real time and reliability of the forecast. Multistep delay prediction can help management use limited sequential data to perceive long-term road network conditions as early as possible and adjust road control plans in time. Long-term forecasts at multiple time steps issued by the management department can also help travellers grasp the dynamic traffic changes during the entire travel period and help choose more optimized travel routes to avoid congestion. Particularly, during peak hours and emergencies, the ability of providing early warning of traffic conditions across multiple time steps will bring more benefits to traffic management and travel planning. e remainder of this paper is organized as follows: Section 2 reviews the related works on traffic state prediction. Section 3 introduces the data definitions and methods. Section 4 demonstrates the optimized process of time backtracking and evaluates the performance of multistep delay prediction. Finally, conclusions are drawn with future works in the last section.

Literature Review
With the continuous improvement of artificial intelligence theory, machine learning algorithms are widely used in traffic flow prediction, such as decision tree (DT) [14], support vector machine (SVM) [15], k-nearest neighbour (KNN) [16], random forest (RF) [17], and gradient boosting decision tree (GBDT) [18]. Because of the powerful nonlinear approximation capabilities, machine learning methods are suitable for short-term traffic flows prediction with strong randomness and can obtain good prediction accuracy. Han et al. demonstrated that machine learning methods (especially SVM and neural networks) can clearly outperform two traditional statistical models [19].
However, due to the shallower architecture of machine learning methods and the lack of deeper representation capabilities, it is difficult for most machine learning predictive models to mine deeper abstract relationships between traffic data. Moreover, many prediction algorithms usually need manual adjustment of parameters. is usually requires experience and skills, which is very time consuming and greatly limits practical applications.
In 2006, Hinton trained a multilayer neural network to reconstruct high-dimensional input vectors and proposed a layered initialization training method to solve the parameter training problem of deep neural networks (DNN) [20]. Since then, a new era of deep learning has started. In essence, deep learning is an expansion of the number of layers of shallow neural networks. e increase in the number of layers gives the deep neural network more nonlinear abstraction and representation capabilities. Journal of Advanced Transportation a short-term traffic flow prediction model based on a CNN deep learning framework to predict the traffic speed with a spatiotemporal feature matrix [23] and report that the prediction result is better than the support vector machine and the shallow neural network model. In addition to the deep neural network algorithms used alone, there are also some ensemble learning methods that combine multiple prediction algorithms and much more flexible structure to obtain better predictive performance. Tang et al. propose a new method in construction fuzzy neural network with the k-means method to forecast travel speed for multistep ahead based on 2-minute travel speed data [24]. Ma et al. noticed days which experience significantly different traffic flow patterns, negatively influencing forecasting results, and proposed an advanced method based on a CNN and LSTM model to select the appropriate predictor for pattern matching [25].
Due to the time series characteristics of traffic flow, the RNN model with time series processing as the core has naturally become the most concerned and popular deep model. By introducing gate functions into the cell structure, the LSTM could handle the problem of long-term dependencies very well; since then, almost all the exciting results based on RNNs have been achieved by the LSTM. e LSTM has become the focus of deep learning [26,27]. Compared with popular time series prediction methods, LSTM can achieve the best prediction performance in accuracy and stability. Tian et al. proposed a learning method based on long short-term Memory, which uses multiscale temporal smoothing employed to infer lost data and the prediction residual [28]. Zhao et al. proposed a traffic prediction model based on long-term and short-term memory networks. It considers temporal-spatial correlation in traffic system via a two-dimensional network which is composed of many memory units. A comparison with other representative forecast models validates that the proposed LSTM network can achieve a better performance [29].
Because of the outstanding performance of LSTM in time series, researchers have paid more attention to and developed many improved models based on it, for example, the gated recurrent unit (GRU) model based on simplified LSTM [30,31], the convolutional LSTM (ConvLSTM) model for spatiotemporal feature extraction (combining the learning capabilities of convolutional neural networks and recurrent neural networks) [32,33], and the bidirectional LSTM (BiLSTM) model used to extract the sequential features from forward and backward directions [34,35].
Deep learning prediction models can mine high-dimensional features of data through semiaffinity nonlinear transformation sequences and learn the nonlinear and nonstationary relations between variables, which have also been found to be suitable for big data analysis with successful applications to computer vision, pattern recognition, speech recognition, natural language processing, and recommendation systems [36]. Deep learning, especially recurrent neural networks, is so successful that more and more ensemble predictive models have been developed based on it. However, few researchers discuss the influence of the number of backtracking steps on the prediction of traffic flow time series and the exploration of multistep delay prediction using existing models.
ese are of great significance for a better understanding of deep neural networks and time series and are exactly what this paper focuses on.

Methodology
In this section, the algorithm framework is introduced first, then the data sequence used is defined, and, finally, the recurrent neural network model and parameter optimization method used are briefly explained.

e Framework of Traffic Flow Time Series Prediction.
e general framework of the traffic flow time series prediction proposed in this paper is shown in Figure 1. e process of prediction consists of two stages: model training and model prediction. In the model training stage, the sequential data of traffic flow was organized as subsequence and fed into the pattern recognition and extraction module for training. ese sequences can be consecutive data or data with a certain interval. In the entire training set and test set, a unified data format is maintained. e training module continuously evaluates and updates the model. Finally, the RNN model is generated, which can recognize and extract the patterns and features of the traffic flow time series.
In the prediction stage, the prepared backtracking sequence is sent to the traffic flow sequence prediction module, then the trained RNN model is loaded and the traffic state prediction is performed with multiple delay steps, and, finally, the prediction result is output.

Definition of Sequential Data.
Time series is a sequence of numbers in which the values of the same statistical indicator are arranged in the order of their occurrence time. Each value in the time series is the result of successive observations of the same phenomenon at different times and is the result of the joint action of different factors. By statistically analysing the regularity of the collected time series, we can obtain the regularity of data changes and predict future development trends.
In general, traffic state data are collected in a given time interval, such as 5 minutes, 15 minutes, or 1 hour. is time interval is also called the sampling interval. No matter which interval is used, the sampled data will be accompanied by a fixed sampling time. At the same time, for traffic speed, this data represents the average speed of vehicles passing by during the sampling interval. eoretically, x can be used to represent traffic flow time series data for N days, where x � (x 1 , x 2 , . . ., x N ), and x i represents the traffic flow data on the i-th day, which is expressed as follows: where x t i is the traffic state data at the t-th time point on the i-th day, t ∈ [1, T], and T is the number of data points per day in the current sampling interval, T ∈ N * .
As we all know, the closer the data is to the current time, the greater the impact on the future model is. erefore, in Journal of Advanced Transportation the process of traffic state forecasting, it is necessary to go back to the previous data and, build a model to find the relationship between historical data and future state, and then use it to predict the future traffic flow. Obviously, a toolong backtracking sequence will lead to too much calculation and the complexity of the model will increase. However, if the backtracking data are too short, the change trends of the traffic series cannot be fully explored, and then the accuracy of the forecast will be reduced. erefore, it is necessary to determine how many steps need to be backtracked to obtain accurate forecast results while ensuring that the model has low computational complexity. Assuming that the number of backtracking steps is L b , the data sequence used for the prediction can be written as follows: where x t i is the t-th data sequence used for prediction on the i-th day; L b is the number of backtracking steps, and L b � 1 means single step backtracking. It can be seen from the formula that the data sequence used for prediction is a new subsequence composed of L b data. Since the prediction uses the data before the current time point t, the sequence which will be used for future prediction may contain the actual data of the previous day or more. So, the time series prediction of traffic state is to trace back the L b data before the current time point (here marked as t) to form a new subsequence to predict the traffic state at the future time point.
Of course, time series forecasting can predict not only the value that will occur at the next time point, but also the value that spans multiple time steps. In the paper, the number of steps that spanned in the forecast is called the number of delay steps, noted by the symbol D s , where the subscript s represents the number of specific delay steps, s ∈ N * . Correspondingly, the real value delayed from the current time point t can be written as follows: where y i t represents the true value at the t-th time point on the i-th day, which is used to calculate the response loss and tune the weights during model training; D s is the number of delay steps, D s � s in numerical value; the definitions of other symbols are the same as those above. Now, the traffic flow time series prediction will be redefined as constructing a subsequence with backtracking L b time steps and forecasting the traffic state after D s delay steps.

Long Short-Term Memory Neural
Network. Long shortterm memory (LSTM) is the enhancement and improvement of recurrent neural networks. It was firstly proposed by Hochreiter and Schmidhuber in 1997, to overcome the disappearance and explosion of gradients in the backpropagation stage of recurrent neural networks [37].
LSTM is usually enhanced by the recursive gate called forget gate, replacing the constant error carousel (CEC) weight with the forget gate activation multiplication, so that the internal memory state can be in the information flow; it resets itself when it is out of date [38]. e "selective forgetting" ability can remember the long-term information dependence and ensure that useful information will not be diluted by the latest information gradually with the passage of time.
is makes the LSTM network suitable for the classification, processing, and prediction of time series data.
An ordinary LSTM unit is composed of a self-connection memory cell and three control gates related to it, as shown in Figure 2. e self-circulation connection updates the state of the memory cell during the time step cycle without external interference. Input gate i and output gate o can allow or prevent information from entering or leaving the recurrent unit. e forget gate f can adjust the memory cell, allowing it to remember or forget its previous state. Over time, these mechanisms enable LSTM to remember or forget certain information [39].
In Figure 2, the input gate and the forget gate use the current input x t and the output response h t−1 of the previous time step to calculate the opening or closing degree to determine whether to accept the current input or forget the previous memory. It can be calculated as follows: where x t is the input vector at time step t; h t−1 is the output response of the previous time step; W and b are the weight matrix and bias vector, which need to be determined during the training process; σ is the activation function, which is usually a sigmoid function in practice. In Figure 2, C t is the cell state of the time step, which allows information to flow along the network without disappearing or exploding. In each time step, the current cell state C t is generated by adding the previous state C t−1 adjusted by the forget gate and the activation input g t of the current time step. e updated formula is defined as follows: where • represents the Hadamard product or element-wise product, which multiplies the elements at the corresponding positions to obtain a new matrix of the same size; g is the input activation function, which is usually the hyperbolic tangent function; C t is cell state and the initial value C 0 is set to 0; the rest of the symbol definitions are the same as those above.
In LSTM, h t−1 will also be fed back to the input to recursively update the next internal memory state. e output gate uses the current input x t and the previous output response h t−1 to calculate the degree of opening or closing, to determine how much newly generated memory information can be output. e output response h t at each time step is adjusted by the output gate o t , and the formula is as follows: where h is the output activation function, which is usually a hyperbolic tangent function; the initial value h 0 is set to 0; other symbols are the same as above. After the sequential data is fed to the model, the LSTM unit begins to process it step by step, and each time step outputs an internal memory state, and the response output h t of the last step is the expected output of the whole recurrent networks.

Optimization of the Backtracking
Step. In the case of limited computing power and resources, the efficiency and performance of the model are particularly critical. Researchers always hope to search for the effective hyperparameters of the model through different optimization algorithms to find the best configuration combination in order to get the best score on the key indicators of the test data set. ere are many algorithms for searching and optimizing hyperparameters, which are widely used in various researches. Zhang et al. used a constrained hybrid genetic algorithm (GA) to determine the key parameters of the tire friction coefficient identification magic formula [40]. Xu et al. used a step-wise regression model to determine the coefficients in the study of evaluating bicycle lane safety [41]. Wei et al. used dual coordinate descent to optimize the parameters of support vector machines to minimize the loss on training data [42].
Coordinate descent is a nongradient optimization algorithm. In each iteration of the algorithm, a one-dimensional search is performed along a coordinate direction at the current point to find the local minimum of a function. In the whole process, different coordinate directions are used cyclically. For inseparable functions, the algorithm may not be able to find the optimal solution in a small number of iteration steps. To speed up the convergence, an appropriate coordinate system can be adopted; for example, a new coordinate system in which the coordinates are not related to each other as much as possible is obtained through principal component analysis. Besides, various optimization algorithms have been widely used in practical research. e coordinate descent method is used in the calculation of the LASSO regression coefficient [43] and the optimization algorithm SMO of SVM dual problem [44]. In addition, this method also has a good performance as a tuning algorithm in other time series forecasting researches. Amir Mahdi et al. used the coordinate descent method to determine the parameters in the link prediction of the multiplex network [45].
Considering the large amount of data in the data set, it usually takes a long time to complete a search. When the number of parameters continues to increase, the time consumed by the model will increase exponentially. Without the guarantee of supercomputing capabilities, it is almost impossible to complete the search process. erefore, this paper adopts the gradual and greedy coordinate descent algorithm, trying to quickly complete the parameter determination with limited computing power and spend the least time.

Data Description.
e proposed model was validated by real-world traffic data from DRIVE Net (Digital Roadway Interactive Visualization and Evaluation Network, http:// www.uwdrive.net). e data was collected by Washington State Department of Transportation (WSDOT) and maintained by STAR Lab (Smart Transportation Applications and Research Laboratory). e actual locations of 16 detectors within 160 to 170 miles on the I-5 expressway are shown in Figure 3. For convenience, we used the milepost of the detector to indicate its name (e.g., detector 16,395 represents the detector located at 163.95 miles). e minimum sampling interval of the captured data is 5 minutes. After where y t is the true value of traffic data in the t-th time point; y t is the predicted value at the same time point; and n is the number of samples.

Determination of the Number of Backtracking Steps.
As mentioned in the previous data definition section, the data sequence used for prediction is a new sequence composed of the L b past data, where L b is the number of backtracking steps. When L b is big, the dependence between the sequences contained in the data will be enhanced. Of course, the amount of calculation and the time consumption will also increase exponentially. While L b is small, the predicted sequence becomes shorter, and the causal relationship between the sequences is incomplete, which makes it difficult to extract and results in poor accuracy. As a hyperparameter, it is necessary to search and determine the number of backtracking steps based on the distribution characteristics of the actual data to strike a balance between model accuracy and computational cost. e first parameter in the model that needs to be determined is the number of backtracking steps. Before optimization, other model parameters need to be fixed, and model pretraining and evaluation will be carried out only within the range of backtracking steps. If the evaluation result is developing in a good direction, then it continues to follow the search direction. If the evaluation result does not change or becomes worse, the training will be stopped immediately. en, the parameter value will be determined according to the previous result, and the search process will end. In the process of coordinated descent, only one parameter is determined at a time, and every iterative evaluation is a one-dimensional search, so the final result is probably not the global optimal solution.
Based on the 5-minute sampling data of the detector 16,395, the step-by-step greedy coordinates descent method is used to search and evaluate the number of backtracking steps of the LSTM model. e number of backtracking steps starts from 2 steps and gradually increases to 20 steps. e sorted data sequence is sent to the model for training. After the training is completed, three error metrics including MAPE, MAE, and RMSE were used to evaluate the prediction effect of the model. e results are shown in Table 1.
In Table 1, the maximum and minimum values of prediction errors are marked in bold. It shows that the maximum errors appear in the minimum backtracking steps (2nd step) and the minimum errors appear in the 14th backtracking step. It can be seen that as the backtracking steps increase, the prediction errors decrease significantly. e visualization of the evaluation results in the table is shown in Figure 4. e x-axis in the figure is the number of backtracking steps, arranged horizontally according to the evaluation step. rough the analysis of the evaluation data, the following conclusions can be drawn: firstly, the number of backtracking steps has a great influence on the accuracy of the model. As shown in Figure 4, the number of backtracking steps on the x-axis increases from 2 to 20 steps. At the same time, the errors of the LSTM model measured by the three metrics gradually decrease from its maximum   Journal of Advanced Transportation value, and the maximum decrease rates are 2.56%, 3.18%, and 3.32%, respectively. is also shows that time series data does have long-term relevance and can be mined through time series models. Secondly, the trends of these three errors are essentially the same at the beginning of the search, and the decline rate is also relatively fast. e inflected point of the smooth decline appeared almost simultaneously in step 14, and the errors also reached the local low. Since then, the fluctuations of the errors have stabilized, and sometimes there has been a slight increase. is phenomenon truly reflects the gradual decrease in prediction error as the backtracking sequence becomes longer. However, a too long sequence not only fails to improve the accuracy of the prediction, but also brings negative impact.
To confirm the generality of the conclusion, another experiment was conducted with a different model (GRU), different data (from detector 16,885), and different sampling intervals (20 min, 30 min, and 60 min). e MAE curves of the predicted results are shown in Figure 5. Except that the lowest points are slightly different and the curves drop more smoothly, comparative experiments have obtained almost consistent results. e detector 16,885 is almost 5 miles away from the first detector used before and has different traffic patterns due to different travel needs, so the evaluation results are representative and convincing. e choice of the number of backtracking steps is related to the models and data. Different models or data may produce different results, but it is certain that there must be a reasonable range of values. Although the difference in model or data set makes the number of backtracking steps fluctuate randomly within this range, this range can be approached step by step by searching multiple models.
In order to verify whether the same decline trend can be obtained on other models, the 5-minute sampling data of detector 16,885 was used to search the backtracking steps of several frequently used models. e results of the backtracking steps searching are shown in Figure 6. e first 5 models are the typical machine learning model, the 6th model is representative of deep neural network models, and the remaining models are three variants of RNNs discussed in this paper. Since each model has a different sensitivity to the correlation of time series, the structure and characteristics of them are also very different. In order to clearly show the differences between all models, the MAE of each model is shown in Figure 6. Although the prediction accuracy of the model is different, the development trend of backtracking steps is approximately the same which can be roughly divided into three situations. e first is that the errors quickly drop to a local low point and then gradually rise after flattening, such as (a) DT and (c) KNN. e second is that the errors decrease rapidly in the first few steps and then remain stable, such as (b) SVM, (d) RF, and (e) GBDT. e last is that the errors continue to decrease after rapidly falling to a local low point and then fluctuate slightly, such as (f ) SAE, (g) LSTM, (h) GRU, and (i) BiLSTM. It can be seen from the trends of the error curves that there may be the most suitable number of backtracking steps, but the dramatic increase in computational cost makes it no longer worthwhile to continue searching.
Although the MAE curves of different models are slightly different, the overall trend is nearly the same. Of course, the number of backtracking steps is not as long as possible. Most models have a maximum limit; too long time series will cause the prediction error to rise instead of falling. To select a    reasonable number of backtracking steps between the various models, considering that the random fluctuations increase after the performance index tends to flatten during the decline process, a range can be determined based on the first inflection point value. According to the results shown in figures, considering the versatility of the model, the value range can be set between 6 and 18. Since the dependence of the time series contained in the data set also has a significant impact on the determination of the number of backtracking steps, the number of backtracking steps on different data sets may change significantly. e characteristics and distribution rules of traffic data in different data sets are quite different. For data sets with flat changes, the time dependence that can be used will be more, and the corresponding number of backtracking steps will be longer. On the contrary, for data sets with drastic changes, the correlation of time series will be greatly reduced, and the number of backtracking steps that can be used will be correspondingly shorter. In practical applications, the number of backtracking steps for each data set can be determined by experimental evaluation.
Considering the universality of the model, the fairness of the comparison results, and the complexity of calculation time and space, all the time series models in this paper use the middle value 12 as the number of backtracking steps. It represents exactly one hour when a 5-minute interval is used.
is uniform number of backtracking steps can provide a relatively fair benchmark for comparison between models.

Multistep Delay Prediction of Time Series.
According to the data definitions in (1) and (2), the traffic flow time series prediction model can predict the time series skipping multiple time steps. Of course, the greater the number of delayed steps, the greater the randomness of the prediction, and the greater the deviation between the predicted result and the true value. However, if the deviation is within an acceptable range, the delay prediction is meaningful.
Taking the data of detector 16,885 for 5-minute samples, the number of backtracking step L b is set to 12, and the number of delay step D s is gradually increased from 1 to 10. A double-layer LSTM model is established to predict the traffic speed delayed from 5 minutes to 50 minutes after the current time point. During the evaluation process, the sampling interval and backtracking steps of traffic data remained unchanged, but delay steps gradually increased. After the model training was completed, the prediction performances of different delay steps were evaluated with three error metrics including MAPE, MAE, and RMSE. e results are shown in Table 2.
e maximum and minimum values of errors in the table are highlighted in bold. It can be seen that the minimum errors appear in the first step, and the maximum errors appear in the 10th step. It shows that the prediction errors increase with the increases of the delay step, which is consistent with the initial expectation. It can be found that, with the increase of the backtracking step, the average increase of the three errors of each delay step is 8.54%, 7.61%, and 8.19%, respectively. In other words, the cost of each additional delay step is that the prediction errors increase by about 8.11%, which is basically within the acceptable range. e multistep prediction results of the LSTM model are shown in Figure 7. Judging from the trend of the indicator curve in the figures, the trends of three error metrics are completely consistent and show an approximate linear growth relationship.
is result is in line with common sense, because of the increase of delay steps, there will be a break between the time series and the predicted target, and the interdependence will weaken accordingly. erefore, it can be concluded that as the number of delay steps increases, the response error of the model shows an approximate linear growth relationship within a limited number of steps.
Another detector with different time sampling data was used for robustness analysis. After evaluating the prediction results, we can see approximate results. e MAE curve of the comparative experiment is shown in Figure 8.
In order to verify the influence of the number of delay steps on the prediction results of other models, the same   processes were performed to predict and evaluate several frequently used models with the 5-minute data of the detector 16,395. e number of backtracking step L b was set to 12. e number of delay step D s was gradually increased from 1 to 10. e MAE of the predicted results is shown in Table 3. e numbers in bold are the MAE of the best performing model in each column.
From the data in the table, it can be seen that the minimum values of the MAE of each prediction model appear in the first step, and the number of delay steps is the least at this time. e maximum values all appear in the 10th step, which shows that the prediction errors increase as the number of delay steps increases. It can be seen that as the number of delay steps increases, the average increase in MAE for each delay step is about 9% to 10%.
Similarly, the MAPEs of multistep prediction are plotted in Figure 9. It can be seen that as the prediction delay steps increase, there is a certain difference in MAPE value between all models, but they all show almost the same linear growth trend, which is precisely the same as the previous conclusions. We can also roughly see the accuracy of different traffic flow models in forecasting. e k-nearest neighbour (KNN) model is located at the top of the graph, far away from other models, and has the worst prediction accuracy among all models. In the middle are other machine learning and deep learning models including SVM, RF, GBDT, DT.  e models with the best prediction performance are at the bottom of figure. ey are a multilayer neural network model and three RNN models including SAE, LSTM, BiLSTM, and GRU. Although the differences among these models are very small, in most cases, BiLSTMʼs performances are still better than those of the first three. Among them, when the number of delay steps is small, BiLSTM and LSTM models are comparable, but, after the last 8 steps, BiLSTM defeats LSTM by virtue of the number of neurons being twice that of LSTM. is also fully shows that the BiLSTM model has great potential in the task of time series prediction of traffic conditions. Nevertheless, these three variants of RNNs stand out from many models, which also shows that the recurrent neural networks have incomparable advantages in mining causality of time series and are suitable for traffic time series prediction and other fields.

Results Analysis
In the experiment, actual traffic flow speed data was selected to verify the prediction performance of the time series model. Primarily, the number of backtracking steps of the LSTM model was searched through the layer-by-layer greedy coordinate descent method. With different detector data, the influences of backtracking steps on the prediction accuracy of time series were verified, discovering the law of model accuracy decline. Considering the universality of the model and the complexity of calculations, the reasonable range of backtracking steps is determined, which provides a unified data basis for the measurement of prediction accuracy and the horizontal comparison between different models. e experiment also tested and analysed the number of delay steps for the time series prediction and found an approximately linear relationship between the prediction accuracy and the number of delay steps within a limited time step. Despite the increase of the number of delay steps, the time gap between the current time point and the predicted time point causes the interruption of the dependency relationship and also makes the error metrics increase to varying degrees. But this kind of delay prediction still has a certain meaning in some applications and can be regarded as an effective alternative. For example, if we want to know the traffic flow after 30 minutes, we can directly train a prediction model with a 30-minute interval for next-step delay prediction, or a 15-minute model for two-step delay prediction, and even use a 5-minute model for six-step delay prediction, and so on.
ere are two differences between the multistep delay model and the conventional single step prediction. One is that the accuracy of the prediction results is different, and the prediction accuracy of the model with the smaller delay steps is better. e second is that the scale of the prediction results is different. e scale of the multistep prediction result depends on the original scale of the model, and the sampling scale of the output result is the same as the sampling scale of the input data. Although there will be some loss in prediction accuracy, multistep delay prediction can still improve the diversity of prediction results and provide a more reliable application.

Conclusion
After discussing the importance of data sequence in traffic state prediction, this paper creates a time backtracking and multistep delay model based on RNNs to learn and extract long-term and short-term dependencies. A case study was conducted to search and determine the optimal backtracking length of the traffic sequence. ree variants of   RNNs and six frequently used models were subjected to multistep delay prediction to verify the impact of the delay step on the prediction accuracy. e main findings are as follows: e prediction accuracy decreases monotonously with the increase of the number of backtracking steps and then slowly rises after reaching a relatively low point. Although it is not possible to prove whether this relatively low point is the global minimum, the number of backtracking steps corresponding to the first low point can bring the best model performance and maintain a more reasonable computational consumption.
ere is an approximately linear relationship between the prediction accuracy of the model and the delay step in the finite time step. Obviously, as the number of delay steps increases, there are multiple time intervals between the current time point and the predicted time point, and the dependence in the sequence will also be interrupted, resulting in a corresponding increase in the error metrics of the model. However, this multistep delay prediction method can still play a special role in some applications, and it also increases the diversity of prediction results.
Comparing the prediction performance of six frequently used models, it is found that three RNN models stand out in all models, which shows that recurrent neural networks have unparalleled advantages in time series causality mining and are very suitable for time series data prediction.
In the future, we will continue our research from three aspects: firstly, traffic flow patterns exhibit strong randomness and uncertainty due to external influences. ere are many factors that affect the accuracy of predictions, such as weather, environment, and social activities.
erefore, more factors need to be considered to improve the accuracy of prediction. Secondly, more intelligent algorithms (e.g., evolutionary computing or genetic algorithms) will be tried to optimize the number of backtracking steps. Finally, the ideas of time backtracking and multistep delay can be extended to other fields and provide inspiration and new research directions.
Data Availability e traffic state data used to support the findings of this study were supplied by DRIVE Net (http://www.uwdrive. net). e processed data are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.