and comparison of long short-term memory networks short-term traffic ANALYSIS AND COMPARISON OF LONG SHORT-TERM MEMORY NETWORKS SHORT-TERM TRAFFIC PREDICTION PERFORMANCE

Summary . Long short-term memory networks (LSTM) produces promising results in the prediction of traffic flows. However, LSTM needs large numbers of data to produce satisfactory results. Therefore, the effect of LSTM training set size on performance and optimum training set size for short-term traffic flow prediction problems were investigated in this study. To achieve this, the numbers of data in the training set was set between 480 and 2800, and the prediction performance of the LSTMs trained using these adjusted training sets was measured. In addition, LSTM prediction results were compared with nonlinear autoregressive neural networks (NAR) trained using the same training sets. Consequently, it was seen that the increase in LSTM's training cluster size increased performance to a certain point. However, after this point, the performance decreased. Three main results emerged in this study: First, the optimum training set size for LSTM significantly improves the prediction performance of the model. Second, LSTM makes short-term traffic forecasting better than NAR. Third, LSTM predictions fluctuate less than the NAR model following instant traffic flow changes.


INTRODUCTION
Nowadays, the number of vehicles and travel demands are increasing rapidly. This increase is responsible for delays, fuel loss and high emissions globally. For this reason, the efficiency of road capacities should be increased by directing and controlling road traffic with intelligent transport systems (ITS). However, ITS needs information about the current status of traffic variables and future estimates of this information (for example, volume, speed, travel time, etc.). For ITS to be more efficient, it is important that traffic parameters be accurately estimated, especially in the short term. Thus, ITS can make fast and accurate decisions for future traffic situations. For this reason, studies on predicting the short-term future situation of traffic become important. Researchers are working to make these predictions more accurate by developing new methods. Especially as deep learning has proven itself in many areas, the use of deep learning in short term traffic prediction has accelerated. Therefore, there is a need for research that better demonstrates the potentials of deep learning in this regard.
The first study on short-term traffic flow estimation was performed using the Box-Jenkins method [1]. Time series methods were used to estimate traffic flow in other studies. [2][3][4][5][6][7][8] However, when artificial intelligence approaches and time series methods were compared, it was observed that artificial intelligence predicted short term traffic flow better [9]. Therefore, in this study, traffic flow estimation models were developed by using artificial intelligence and deep networks approaches and the size of the training sets were discussed [10].
Short-term traffic flow estimation was performed with ANNs in earlier times from deep learning approach. For instance, the dynamic wavelet ANN model was used to estimate traffic flow [11]. Dynamic traffic flow modelling is another approach to determine the amount of traffic flow [12]. ANN and K-NN were used together to estimate traffic flow [13]. In another study, multiscale analysis-based intelligent ensemble modelling was used to predict airway traffic [14]. The traffic flow was modelled for the city of Istanbul using different time resolutions and the results were accurate despite the limited data [15] and some others [15][16][17][18]. Deep learning has recently gained interest in the prediction of various traffic parameters. Long short-term memory (LSTM) is in the sub-branch of deep learning. Previous studies on LSTM have evidence that deep learning and the performance of other methods were compared. For example, LSTM was compared with regression models [19]. As a result, LSTM has generally made better predictions, except in some cases. Researchers developed a model using LSTM to predict the short-term traffic flow in exceptional traffic conditions. In addition, the authors studied the characteristics of traffic data [20]. In another study, LSTM and recurrent ANN models were compared with ARIMA models [21]. As a result, researchers mentioned that artificial intelligence models work better. In the other study, LSTM and recurrent ANN and regression models were compared with LSTM obtaining better results [22].
LSTM and short-term traffic flow were reviewed in the literature, but so far, there was no study on the effect of training set size on LSTM performance. Therefore, in this study, LSTM and nonlinear autoregressive neural networks (NAR) were trained with different training sets size and the optimum size was determined for the problem. In addition, two models were compared, and their results discussed. Thus, the results of this study will help to determine the size of the training set for future studies.

21.
This article consists of introduction, methodology and conclusion. The subject and importance of this study are discussed, and the related literature is summarised in the introduction section of this article. The data used and the estimation of missing data are presented in the methodology section. Then, LSTM and NAR approaches were briefly explained, and the parameters of the models used in the study were introduced. Thereafter, NAR and LSTM estimates were tested by hypothesis testing and the results were discussed. Finally, the conclusions of the study were recalled in the conclusion section and recommendations were made for future studies.

Data collection and missing data
Traffic flow data were collected from the D200 highway. This highway connects the major cities of Turkey. The main road traffic is not interrupted 20 km forward and backward from the counted section. Therefore, there are uninterrupted conditions in the counting section. Data collection was performed with NC-350 traffic counters [23]. The counting was conducted with traffic counting devices placed separately for the left and right lanes. The devices were set to record data every 15 minutes. Devices were counted for 47 days and 4,512 traffic flow data were collected.
In the counting process, data cannot be recorded at some time intervals and this is very common. This data is called missing data. This is often the result of faults in the device or the limitations of the counting device. After counting operations, it was found that approximately 1% of the total data was missing. Autocorrelation reveals the degree of relationship of time series points with each other. The points with high autocorrelation are used in making future predictions. To complete the missing data, traffic data with high autocorrelation were used with missing data. The results of the autocorrelation calculation result are given in Fig. 1. Autocorrelation was high at point 672. Each counting operation has 15 min intervals. In other words, every point in the time series is related to the point 7 days (672 / (24 * 4) previous. This is a very common pattern in traffic flows. In this case, the missing data can be completed with the value at the point 672 interval before the missing point. Let X ∊ ℤ be traffic data with missing values and xt ∊ X indicates the traffic flow data at time t. Also, let ∊ ℤ denote the missing data in the series and at time t. According to these definitions, the missing data is completed as in Equation 1. (1) After completing the missing data, the data set was standardised with Equation 2 before training the models.
( 2) where, xstd standardised data, x raw data, mean of the dataset, sx standard deviation of the dataset.

Long short-term memory
Long short-term memory network (LSTM) is an advanced type of recurrent neural networks (RNNs) that can overcome the long-term dependence problem. RNNs produced successful results in sequence prediction tasks. However, it is often difficult for RNNs to learn long-term patterns [24]. LSTM can understand short-or long-term dependencies with the help of units that learn when to forget and when to update the information.
Let xt be the input vector, ht be the output of the LSTM unit and Ct be the cell state at time t. In the first step, how much of the information in the Ct-1 will be forgotten is determined by forget gate. The forget gate is a layer that uses sigmoid function and uses ht-1 and xt to generate values between "0" and "1". Therefore, ft in Fig. 2 can be written as: The next step is to identify new information that will be stored in the cell state. This step consists of two sub-steps: The first step is the input gate, which determines what information to update. The second step determines the vector containing the candidate values. In Fig. 2, the output value of the input gate is represented by it, while the output value of the second section is indicated by . The it and can be written as: and (4) After these steps, the old state vector (Ct-1) is updated to reveal the new state vector (Ct). The update process can be written as: (5) Analysis and comparison of long short-term memory… 23.

Fig. 2. Long short-term memory network unit
The last step is to determine the hidden state (ht): The output gate (ot) is the process that determines which parts of the cell state will be in the output and can be written as: (7) where σ() is the sigmoid function, W(f,i,c,o) matrices are the network parameters, b(f,i,c,o) is the bias matrices. And ⊙ denotes the product operation. LSTM can successfully overcome the exploding/vanishing gradients problem with these processes and gates [25].

Nonlinear auto-regressive neural networks
Nonlinear autoregressive neural networks (NAR) are a customised neural network (ANN) model for time series. NAR predicts the future value by using the past data of the time series. NAR needs a training set like other ANNs. Let X ∊ ℤ be the traffic flow data and xt ∊ X denotes the traffic flow value at time t. In this case, the future traffic flow value will be: x _(t+1)=f(x_t,x_(t-1),…x_(t-d)). Where, x _(t+1) is the prediction value of the NAR, f(x) expresses the NAR black-box function and the d is the delay value. Backpropagation algorithm [26] and Levenberg-Marquardt method [28,29] were used for training.
The connections of the NAR with the hidden and the output layers are shown in Fig. 3. The model uses a delay parameter to estimate the traffic flow at time t + 1. In this study, in the hidden layer tangent hyperbolic and in the output layer linear function were used as activation functions. To determine the appropriate NAR architecture, the number of E. Dogan hidden layer neurons was tested from 5 to 35. Then, the RMSE of different NAR architectures were analysed and it was decided that 3-10-1 was the appropriate NAR architecture.
In this section, we first introduced the creation of training and test sets. Then, the effect of the size of the training sets on the predictions of NAR and LSTM was examined and finally, the prediction results of the two methods were evaluated by statistical tests. The pseudo-code for the creation of training and test sets with these representations is as follows: 1. Start 2. Let, n := |X|, r :=|tm|, p :=|ej|, 3. m = 1, 4. j = 1, 5. ej = {xt | (t>(n-(j*r+p) ⋀ t≤(n-j*R)} 6. tm={{xt | (t>(n-(j*r) ⋀ t≤(n-(( j-1)*r)), 7. If j < p and m<r Then, j = j + 1 and turn back to Step 4 If j = p and m<r Then, m= m + 1 and turn back to Step 4 If j = p and m = r Then, Stop.
The delay parameter or lag value was kept equal in the LSTM and NAR models, and this value was set to . Thus, regardless of parameter d, the effect of data set size on performance was compared.
NAR and LSTM models with training set size 480 were named NAR5 and LSTM5 and the test results were given in Fig. 4 using box-plot. In Figs. 4 and 5, the outliers were shown with the (+) sign. If these (+) signs are counted from Figs. 4 and 5, it is understood that while LSTM produces ten outliers, the NAR has four outliers. This result indicates that LSTM predictions are rarely more than expected. When the median values were examined, it was seen that the value of NAR5 was higher than the value of LSTM5. In addition, it was observed that the range of LSTM5 was smaller than NAR5 with the examination of the upper/lower whiskers. Simply put, the LSTM approach was able to produce better results than the NAR with the smallest training set size examined.  Fig. 5 that the LSTM produces lower RMSE for all training set sizes. It was observed that NAR error values were oscillated by increasing the size of the training set, but no clear decrease was observed. Furthermore, it is understood from Fig. 5 that the LSTM error values tend to decrease clearly for the same training set size increase. Thus, following examination of the average RMSE values of the models, it was found that the lowest error was in for NAR and LSTM. Based on this, the error values of the models due to their training with , training set size were examined more closely. Fig. 5 shows that the maximum RMSE value of NAR25 is 17 veh. However, the maximum prediction error value of LSTM25 was about 13 veh. To observe the prediction of the models in more detail, the 17th test day was examined in Fig. 6. And to observe the estimations of the models in more detail, the 17th test day was examined in Fig. 6. The results of the remaining test days are presented in Appendix 1 for the reader's review. The coefficients of estimation of the two models were calculated and it was determined that both models produced high R 2 values. The calculated R 2 values for the remaining days can be examined in Appx 2 and 3. Like the RMSE values examined in the previous figures, LSTM predictions produced R 2 values higher than NAR predictions for all test days. A remarkable situation was seen during the comparison of the models on the line graph. In Fig. 6, the prediction line of NAR makes high fluctuations to approach the actual value. On the other hand, the fluctuation of LSTM was less than NAR. The same examination was performed for the other test days and the same result was reached. In the light of these results, it was concluded that LSTM was less affected by instant traffic flow changes than NAR model.
Although the LSTM was found to be more accurate than NAR, the statistical significance of this result was tested by t-test. The established hypothesis statements were established as follows: H0: If LSTM is used instead of NAR, the mean RMSE does not change. (μLSTM =μNAR) H1: If LSTM is used instead of NAR, the mean RMSE is decreased. (μNAR> μLSTM) where, μNAR and μLSTM represent the mean of the estimation errors of NAR and LSTM, respectively. as less mean prediction errors than NAR. The confidence level of the hypothesis test was 95% (α = 0.05). The p-value was examined from the table and it was seen that p <0.05 was found for the other training set sizes except for the 5-day training set. In the light of these results, except for 5-day training set, H0 was rejected and H1 was accepted. The statistical analysis confirmed that the LSTM model usually predicted traffic flow more accurately than the NAR model for 15-min data. In addition, the improvement in the predictive performance of the NAR model was not observed by increasing the size of the training set. However, the improvement in the predictive performance of the LSTM model was clearly observed by increasing the size of the training set. However, it was determined that the increase in the size of the training set should be at certain levels. For this study, it was found that this size should have 2400 data (25 days) number for both models.

CONCLUSION
Accurate short-term traffic forecasts will improve the decision-making capabilities of traffic control systems. Thus, traffic flow and traffic safety will reach better levels. In this study, training sets of different sizes were created. Then, the effects of these clusters on the predictive performance of LSTM and NAR models were examined. In terms of short-term traffic estimation, it was understood from the analysis results and statistical tests that LSTM models have better predictions than NAR models. The conclusions of this study are as follows:  This study showed that a large amount of training set does not increase performance. For this reason, the optimum training set size of the new deep learning approaches should be determined.  The larger training set size does not always mean better performance for LSTM and NAR.  Improvement in LSTM estimation performance is observed towards optimum training set size. However, the same feature cannot be mentioned for NAR.  LSTM is less affected by instant traffic flow changes than the NAR model. Therefore, LSTM produces stable results from NAR for short-term traffic prediction.  Statistically, the LSTM approach performs better than that of NAR when the training set size is greater than 480.  It was observed that LSTM produced more outliers than NAR. Therefore, in rare cases, E. Dogan LSTM is likely to make high errors.  In this study, the size of the LSTM training set was discussed in the context of the prediction of traffic flow. The effects of other parameters of LSTM will be investigated in future studies. For this study, tests were performed for a time interval of 15 minutes, which is commonly used in the literature. In addition, smaller time intervals can be investigated in future studies. Another limitation of this study is the use of only one data set. Future studies will be enriched with different data sets from different regions.
ITS will be an indispensable tool in the future traffic control of cities. This will make future traffic flow forecasts much more important. Therefore, it can easily be foreseen that the studies will continue for more effective use of deep learning in road traffic prediction.