Next Article in Journal
AI and Digital Transformation in Higher Education: Vision and Approach of a Specific University in Vietnam
Previous Article in Journal
Green Transportation Model in Logistics Considering the Carbon Emissions Costs Based on Improved Grey Wolf Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Traffic Flow Prediction Based on Hybrid Deep Learning Models Considering Missing Data and Multiple Factors

1
Faculty of Maritime and Transportation, Ningbo University, Ningbo 315211, China
2
Jiangsu Province Collaborative Innovation Center for Modern Urban Traffic Technologies, Nanjing 210096, China
3
National Traffic Management Engineering and Technology Research Centre, Ningbo University Sub-Centre, Ningbo 315211, China
*
Author to whom correspondence should be addressed.
Sustainability 2023, 15(14), 11092; https://doi.org/10.3390/su151411092
Submission received: 6 May 2023 / Revised: 12 July 2023 / Accepted: 13 July 2023 / Published: 16 July 2023

Abstract

:
In the case of missing data, traffic forecasting becomes challenging. Many existing studies on traffic flow forecasting with missing data often overlook the relationship between data imputation and external factors. To address this gap, this study proposes two hybrid models that incorporate multiple factors for predicting traffic flow in scenarios involving data loss. Temperature, rainfall intensity and whether it is a weekday will be introduced as multiple factors for data imputation and forecasting. Predictive mean matching (PMM) and K-nearest neighbor (KNN) can find the data that are most similar to the missing values as the interpolation value. In the forecasting module, bidirectional long short-term memory (BiLSTM) network can extract bidirectional time series features, which can improve forecasting accuracy. Therefore, PMM and KNN were combined with BiLSTM as P-BiLSTM and K-BiLSTM to forecast traffic flow, respectively. Experiments were conducted using a traffic flow dataset from the expressway S6 in Poland, considering various missing scenarios and missing rates. The experimental results showed that the proposed models outperform other traditional models in terms of prediction accuracy. Furthermore, the consideration of whether it is a working day further improves the predictive performance of the models.

1. Introduction

Traffic congestion has now become a common problem in many large cities [1]. When traffic congestion occurs, it is accompanied by negative impacts such as economic losses, increased difficulty in managing traffic and air pollution [2]. Traffic congestion can be greatly reduced by accurately predicting traffic flows and helping travelers to make informed route choices. On the other hand, traffic management department can guide traffic with traffic flow forecasts and thus provide a comfortable route for travelers. For the entire road network, relieving traffic congestion requires a sound Intelligent Transport Systems (ITS) of which traffic flow prediction is a key factor [3]. Efficient and accurate traffic flow prediction can provide data to support the working of ITS.
In recent decades, traffic forecasting has been studied involving traffic flow prediction, travel time forecasting, speed forecasting, and so on. In this regard, traffic flow prediction can be divided into short-term and long-term ones. Short-term traffic flow forecasting is more widely studied because of its greater applicability. Vlahogianni et al. [4] has shown that short-term traffic predicting methods fall into two main categories, namely classical statistical methods and computational intelligence (CI) methods. The most commonly used classical statistical methods are the autoregressive integrated moving average (ARIMA) model and its refinements [5,6]. But these methods are usually designed for small data sets and are not suitable for dealing with complex and dynamic time series data. Currently, the most commonly used forecasting methods are CI methods, such as artificial neural networks (ANNs) [7], convolutional neural networks (CNNs) [8], long short-term memory (LSTM) [9] neural networks and graph convolutional network (GCN) [10].
Different external factors may have an impact on traffic flow. It is common for some scholars to add relevant factors within the model in the hope of improving prediction accuracy. Zhang et al. [11] used a multi-factor gated recurrent unit (GRU) for traffic flow prediction, incorporating factors such as precipitation, average wind speed, maximum temperature, minimum temperature and weather types into the model. The results proved that the multi-factor GRU model provided better prediction results. Chen et al. [12] proposed the attentive attributed recurrent graph neural network (AARGNN) which predicts short-term traffic flow considering both static and dynamic factors. Experiments on real-world datasets showed that the proposed method outperforms all baseline methods. He et al. [13] proposed the multi-graph convolutional-recursive neural network (MGC-RNN). They creatively generated five correlation diagrams with multiple external factors as model inputs to predict subway passenger flow.
Most of the existing studies simply added external factors into prediction models and drew conclusions that these external factors can improve prediction accuracy. But He has shown that not all factors improve the accuracy of predictions [13]. Few studies have focused on the relationship between external factors and prediction accuracy.
The majority of predictive models rely on the completeness of the data set. However, due to some unavoidable factors, the information collected by the sensors may have missing data. To reduce the impact of missing data, data imputation has been used in prediction models. Such prediction models are classified as hybrid and fusion models. Khan et al. [14] used multiple imputation methods and a combination of neural networks to predict the daily average traffic flow and hourly traffic volume. The combination of LSTM and mean-fill models was eventually found to provide the best prediction results. Traffic flows are significant not only temporally but also spatially. Tensors provide a simple and effective approach to represent spatio-temporal traffic flows. Therefore, several scholars have adopted a tensor-based approach to complement traffic flow interpolation and prediction [15,16,17]. The graph Laplacian method offers an efficient approach for extracting spatio-temporal information, which can be combined with LSTM to achieve accurate predictions in scenarios with missing data [18]. Zhao et al. [19] proposed two mean imputation methods combined with LSTM to achieve traffic flow prediction under three missing modes. All of the above studies are the combination of imputation methods and prediction models to achieve prediction.
On the other hand, fusion models have been proposed which can perform both data imputation and traffic flow prediction. Cui et al. [20] proposed an LSTM structure with imputation units (LSTM-I) to fill in the missing values in the input data. The two-layer bidirectional LSTM-I achieved high accuracy in attribution and prediction under different missing patterns, even with 80% of the data missing. However, LSTM-I is limited to extracting only temporal features and lacks the ability to capture spatial feature. For better consideration of spatial factors, graph convolution is widely used. Cui et al. [21] proposed graph Markov network (GMN) and spectral graph Markov network (SGMN) with spectral graph convolution operations. The GMNs and SGMNs were experimentally shown to perform well in terms of prediction accuracy and efficiency. While GMN and SGMN lack the capability to extract time series features, GRU excels at capturing time-varying features. Combining convolutional operations with GRU neural networks provides a distinctive advantage in spatio-temporal prediction. Zhang and Dong have further enhanced this approach to enable accurate predictions in various data scenarios [22,23].
In these studies, both hybrid and fusion models ignore the impact of multiple factors on traffic flow. There is no doubt that multiple factors can have some influence on traffic flow. Until now, there are relatively fewer studies that consider multiple factors in both imputation and forecasting models. Due to this reason, we incorporate multiple factors into the imputation and forecasting methods to further improve accuracy and reduce the impact of missing data.
Two hybrid models have been proposed to explore the impact of multiple factors on prediction accuracy under different missing scenarios with different missing rates as follows:
  • The influence of multiple factors was considered to enhance the interpretability of feature selection;
  • Two prediction models have been proposed for the missing data scenario, and multivariate data are used to improve accuracy in the missing data scenario;
  • To be more realistic, a random missing scenario and a non-random missing scenario were set up and the impact of different missing scenarios on prediction accuracy is explored.
The rest of the paper is organized as follows. Section 2 presents the structure of the proposed models. Section 3 describes the data sources and correlation analysis. Section 4 analyzes the prediction results of different combinations of models incorporating different factors. Section 5 highlights the conclusion of this paper and the outlook for future work.

2. Methodologies

In this chapter, the imputation models and the prediction models are introduced firstly, and then the proposed models are presented in detail.

2.1. Imputation Models

2.1.1. K-Nearest Neighbor

KNN is a basic machine learning algorithm for classification and regression. The central idea is to use the samples that are closest to the unknown samples to carry out the classification and prediction. Changes in traffic flow can be influenced by external factors and the KNN algorithm can interpolate missing traffic data by picking up the traffic flows that are most similar to the external factors. The advantage of this algorithm is that it takes into account the impact of external factors on the traffic flow. The KNN algorithm can be implemented in the following two steps [24].
Step 1: In this step, k samples are selected mainly according to their distances. Too large or too small k will increase the error. Through repeated experiments and verifications, we found that when k = 5 , the imputation effect is the best. The closer the two samples are, the higher the similarity, and vice versa. The distance is measured via Euclidean distance, and its calculation formula is shown in Equation (1):
d = j = 1 m ( x j y j ) 2
where m represents the number of influencing factors other than traffic flow, x j is the value of the j-th factor for missing data and y j is the value of the j-th factor of the complete data.
Step 2: After getting all the distances between the missing data and the complete data, select the k complete data closest to the missing data, and calculate the average of the k traffic flows to fill in the missing values.

2.1.2. Predictive Mean Matching

Multiple imputation (MI) as an effective data imputation method was first proposed by Rubin in 1977 [25]. PMM was proposed by Little and has been refined to become one of the most classical and commonly used MI algorithms [26]. PMM is based on a complete data set, regressed on the corresponding variables, followed by a regression model to obtain imputed values which are filled by taking the mean of multiple imputed values. The PMM algorithm proceeds as follows [27]:
Step 1: Let the sample size be n . The number of samples with no missing data is n o b s and the number of samples with missing data is n m i s . Y o b s and Y m i s denote the existing observations and missing values in Y , respectively. X = ( X 1 , X 2 , X k ) is a set of fully observed covariates, which includes X o b s and X m i s , with X m i s corresponding to the missing part observed in Y .
Step 2: Use Y o b s and X o b s to calculate the least squares estimates β ^ = ( β ^ 0 , β ^ 1 , β ^ 2 β ^ k ) ; errors ε , and residual variances σ ^ 2 can be computed as:
Y   =   β 0 + β 1 X 1 + β 2 X 2 + + β k X k
Step 3: σ ^ 2 is subject to a χ2 distribution with degree of freedom n o b s k 1 . Take a random number g from the χ2 distribution, and obtain the random observation σ 2 .
σ 2 = σ ^ 2 ( n o b s k 1 ) / g
Step 4: Draw β * from a multivariate normal distribution centered at β ^ with covariance matrix σ 2 .
Step 5: The fitted and predicted values are calculated as follows:
Y o b s = β ^ 0 + β ^ 1 X 1 + β ^ 2 X 2 + + β ^ k X k
Y m i s = β 0 * + β 1 * X 1 + β 2 * X 2 + + β k * X k
Step 6: Calculate the distance Δ i between Y ^ o b s , i and Y ^ m i s as follows:
Δ i = | Y ^ o b s , i Y ^ m i s |
where i = 1 , 2 , 3     n o b s .
Step 7: Select the smallest Δ i corresponding to the Y ^ o b s , i as the imputation value.
Repeat step 2–7 times and choose the mean.

2.2. Prediction Models

2.2.1. Long Short-Term Memory

To solve the gradient vanishing and gradient exploding problems, Hochreiter and Schmidhuber proposed the LSTM neural network on the basis of recurrent neural networks (RNNs) [28]. As shown in Figure 1a, an LSTM with a unique chain structure is able to capture the regular characteristics of time series and thus achieve time series prediction. As the parameters of each structure are independent, the gradient disappearance and gradient explosion problems are effectively avoided. In Figure 1b, the internal structure of the LSTM is shown in detail and is composed of forget gate f t , input gate i t , output gate o t , memory cell c t and current output h t . The output value of its previous unit h t 1 , the cell state of the previous unit c t 1 and the input data x t are used as the input of the current unit. LSTM can be described using the following formulas [29]:
f t = σ ( W f x t + U f h t 1 + b f )
i t = σ ( W i x t + U i h t 1 + b i )
o t = σ ( W o x t + U o h t 1 + b o )
c ˜ t = tanh ( W c x t + U c h t 1 + b c )
c t = f t   c t 1 + i t   c ˜ t
h t = o t   tanh ( c t )
where W f , U f , W i , U i , W o , U o , W c and U c are weight matrices, b f , b i , b o and b c are bias vectors, is the Hadamard product, and σ and tanh are activation functions. Their formulas are as follows:
σ ( x ) = 1 1 + e x
tanh ( x ) = e x e x e x + e x
The final predicted value y ^ t + 1 after passing through the fully connected layer is:
y ^ t + 1 = tanh ( h t W y + b y )
where W y is weight matrix, and b y is bias vector.

2.2.2. Bidirectional Long Short-Term Memory

LSTM is a forward training model that can only extract forward time series information, and reverse information is not well extracted. To this end, Graves and Schmidhuber have proposed a BiLSTM that combines reverse LSTM and forward LSTM [30]. Because it extracts time series information in both directions, it has more advantages in terms of prediction. The structure of the BiLSTM is shown in Figure 2. In the figure, x is fed into the forward and reverse LSTM to obtain the output of the LSTM in different directions, which are combined to obtain the final prediction y . The final predicted value is calculated as follows:
y ^ t + 1 = tanh ( h t W y + h t U y + b y )
where h t and h t denote the output result of the forward LSTM and the one of the reverse LSTM, respectively. W y and U y are weight matrices and b y is bias vector.

2.3. Proposed Hybrid Model

In order to solve the problem of missing data, we combine the imputation module and the prediction module to complete the traffic flow prediction in the presence of missing data. PMM, KNN and BiLSTM are combined to form P-BiLSTM and K-BiLSTM, respectively, as shown in Figure 3 with their model structures. In the models, there are two main modules, namely the imputation module and the prediction module. The imputation module completes the imputation of the data, while the prediction module predicts the traffic flow.
The input is divided into two parts: the traffic flow y m with missing data and three multiple factors, including whether it is a weekday w , the rainfall intensity r and the temperature z . The expressions for the inputs at time t is as follows:
x t = ( y t m w t r t z t y t Δ m w t Δ r t Δ z t Δ y t n Δ m w t n Δ r t n Δ z t n Δ )
where Δ denotes the time interval and t n Δ denotes t indicates the traffic flow and the statistics of each factor for the previous n time periods.
Traffic flow is normalized to a range of [ 0 , 1 ] . A binary variable is used to indicate whether it is a weekday, with “1” representing workday and “0” indicating the weekend.
The x t with missing values is fed to the KNN or PMM to obtain the complete traffic flow data y ˜ via the imputation module, followed by the flow and external factors together into BiLSTM via the fully connected layer to obtain the final output y ^ .

3. Data Analysis

3.1. Data Sources

The traffic data used for predication are from the permanent traffic counting station located on the expressway S6 in the Tricity agglomeration area in Poland. The data covers a three-year period from 2014 to 2017, as well as traffic in one direction (southbound). Tricity Bypass Road (expressway S6) is the eastern end segment of the Polish National Road No. 6 which runs along the Baltic coast between the cities of Szczecin and the Tricity Metropolitan Area, comprising the cities of Gdansk, Sopot and Gdynia. As shown in Figure 4, red pentagram indicates counting stations and blue line represents expressway S6.
The data are aggregated into 5 min intervals. Traffic volume for each time period, rainfall intensity, temperature and whether it was a weekend were used as data for our experiments. However, in the original dataset, the data from 2:50 to 2:55 am on 2 November 2014 was missing, accounting for a mere 0.0165% of the total. And given that it was in the late evening, there were no obvious features. To avoid any impact on subsequent experiments, we opted a relatively simple hot-deck imputation method. For the missing traffic data, data from the same moment in time two days before and after were used to fill in the missing values. Missing temperature and rainfall intensity were filled in with data from before and after. After processing, the data are shown in Table 1. The data used in this study, presented in Table 1, cover the time period from 27 October 2014 to 16 November 2014. “Temperature” represents the average temperature within five minutes; “Rain intensity” ranges from 0 to 100, the larger the value, the more rainfall; and “Working day” is calculated and judged by date: “1” means weekday, “0” indicates the weekend.

3.2. Correlation Analysis

Temperature, rainfall intensity and whether it is a weekday were chosen as influencing factors for traffic forecasting. In order to explore the impact of the correlation on the prediction accuracy, the Pearson correlation coefficient was adopted to describe the relationship between traffic flow and the variables. It is worth noting that the variable of whether it is a weekday is a categorical variable and the traffic flow data are continuous variables, so the Pearson correlation coefficient cannot describe the relationship between these two variables well. Therefore, we use the Pearson correlation coefficient to describe the relationship between traffic flow and temperature and the relationship between traffic flow and rainfall intensity. The Pearson formula is as follows [31]:
r = i = 1 n ( X i X ¯ ) ( Y i Y ¯ ) i = 1 n ( X i X ¯ ) 2 i = 1 n ( Y i Y ¯ ) 2
where r represents the Pearson correlation coefficient, X i and Y i are the i-th value of variable X and the i-th value of variable Y , respectively, and X ¯ and Y ¯ denote the mean of the variable.
The Pearson correlation coefficient indicates a linear relation between two indicators. It ranges between −1 and +1 and values closer to −1 and +1 imply a strong correlation. Also, a positive correlation coefficient implies that an increase in one indicator would result in an increase in another indicator, and vice versa. The relationship between the r value and the correlation strength is shown in Table 2 [31].
The result of Pearson correlation coefficient analysis is displayed in Figure 5. From the correlation coefficient in Figure 5, it can be seen that the correlation coefficient between traffic flow and temperature is 0.38, and which is relatively low. The correlation coefficient between traffic flow and rainfall intensity is only −0.21, implying almost no correlation.
Daily traffic is extracted for autocorrelation. The correlation result is shown in Figure 6. The time range is from 27 October 2014 to 16 November 2014 of which 1 November, 2 November, 8 November, 9 November, 15 November and 16 November are non-working days. The conclusion that there is an extremely strong correlation between working days and working days and the same characteristic between non-working days and non-working days is revealed in Figure 6. Although there is a strong correlation between working days and non-working days, the correlation is reduced compared to the previous two. In addition, the correlation between traffic flow on 10 November and 11 November and traffic flow between non-working days is higher. This observation may be attributed to that fact that 11 November was a national holiday (Independence Day), so it is highly likely many people chose to take an extended holiday. Considering the circumstances, we designated 10 November and 11 November as non-working days for subsequent experiments.

4. Experiments

In this section, the validity of the proposed models will be explored firstly. Then, we conduct experiments on different missing scenarios considering different factors, and judge the impact of different factors on the prediction based on the experimental results. Finally, predictions are made at different stacking levels to test the effect of stacking levels on the effect of the models.

4.1. Missing Data Setting

When traffic flow data are missing, the amount and distribution of missing data can have an impact on the prediction performance. To explore the impact of missing data on prediction, type of missing data and the rate of missing data are set.
In this paper, two types of missing data scenarios are set by us, which are random missing data scenario and non-random missing data. In the case of random missing data, the missing data are random and the missing data do not depend on any variable. As shown in Table 3, this scenario does not have any pattern in the missing rate. The other missing case is the non-random missing case. In this missing case, the missing data depend on other variables to some extent and show some regularity. In this study, we set it as continuous missing in the same time period. As shown in Table 4, this missing case shows as consecutive days of missing data in the same time period.
Based on the type of missing data, we set three rates of missing data: 10%, 20% and 30%, respectively.

4.2. Parameter Setting

Through iterative testing, the final parameters of the models were determined as shown in Table 5.
In addition to the above parameters, the first 16 days were used as the training set and the last 5 days as the validation set, with mean squared error (MSE) as the loss function, expressed as follows [32]:
MSE = 1 T t = 1 T ( y ^ t y t ) 2
where y ^ t and y t denote the predicted values and actual values at time t , respectively; T is the total number of predicted samples.

4.3. Evaluation Metrics

To evaluate the model performance, three evaluation metrics were used, namely the mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (R2), which can be defined as follows [33]:
MAE = 1 T t = 1 T | y ^ t y t |
RMSE = 1 T t = 1 T ( y ^ t y t ) 2
R 2 = 1 t = 1 T ( y ^ t y t ) 2 t = 1 T ( 1 T t = 1 T y t y ^ t ) 2
The MAE is the average of the absolute errors, regardless of the positive or negative side of the error, and ranges from 0 to infinity. The MAE is characterized as being relatively insensitive to the point of outliers.
Like the MAE, the RMSE takes on a range of values from 0 to positive infinity; the larger the error, the larger the value of the RMSE. However, RMSE is more affected by outliers.
R2 value closer to 1 means that the prediction is better. If the R2 value is 0, this means that each predicted value of the sample is equal to the mean, exactly the same as the mean model. If the R2 value is less than 0, it means that the constructed model is not as good as the mean model [33]. In the subsequent experimental results presentation, the percentages of R2 will be used.

4.4. Prediction Results without External Factors

In this section, external factors are not taken into account in the model. GRU, RNN and LSTM were combined with the estimated model to form the corresponding hybrid models and compared with the proposed models, which illustrate their validity. The prediction module parameters are the same as those in Table 5. All models were implemented with tensorflow and keras framework.
The experimental results in Table 6 and Table 7 show the prediction results for the random missing scenario and the non-random missing scenario, respectively. The results show that K-BiLSTM outperforms other models regardless of the missing scenario and missing rate, and K-BiLSTM prediction accuracy is better than the combination of KNN and other models, and P-BiLSTM also shows the same characteristics. The tables revealed that the improvement in prediction accuracy for the K-BiLSTM and P-BiLSTM is not significant at a lower data missing rate. However, at the data missing rate of 30%, the prediction accuracy improvement in both models becomes more prominent. Another conclusion that can be drawn from these two tables is that the prediction error using the same model in non-random scenarios is slightly lower than that in random missing scenarios with the same missing rate under different missing scenarios. The reason for this phenomenon is that the interpolation module does a better job of completing the data in the case of non-random missing scenario and retains more of the traffic flow characteristics. In the random missing scenario, the prediction effect of the combined model of PMM will be slightly worse than that of the combined model of KNN. This demonstrates that the KNN module can better handle missing data without including external factors. Moreover, the increase in the data missing rate has, to some extent, resulted in a decrease in model prediction accuracy. In the random missing scenario, as the missing data rate increased from 10% to 30%, the K-BiLSTM exhibited an increase in MAE and RMSE by 1.94 and 2.88, respectively, and R2 increased by 1.2%. For the P-BiLSTM, the corresponding increase in MAE and RMSE was 1.75 and 3.8, respectively, and R2 increased by 2.25%. A similar trend was observed in the non-random missing scenario.

4.5. Prediction Results Considering External Factors

In the previous section, predictive performance of K-BiLSTM and P-BiLSTM has been proven. Herein, temperature, rainfall intensity and whether it is a weekday will be added into the model to test the relationship between forecast accuracy and external factors.
Table 8 and Table 9 show the prediction results under the random missing scenario and the prediction results under the non-random missing scenario, respectively. The letters at the bottom of the model denote the external factors added, with z, r and w indicating temperature, rainfall intensity and whether it is a weekday, respectively. If there is no letter, it means no external factor is added. Figure 7 and Figure 8 show the prediction errors with different missing rates in different missing scenarios, respectively.
From the experimental results, it can be seen that adding different factors to the model can have a significant impact on the prediction accuracy. The inclusion of temperature and rainfall intensity will reduce the prediction accuracy to some extent, while the inclusion of whether it is a weekday will improve the accuracy of the model. Especially in scenarios where 30% of data are missing at random, the MAE and RMSE of the K-BiLSTM considering whether it was a working day decreased by a maximum of 2.11 and 3.36, respectively, and R2 increased by 1.37%. The MAE and RMSE of the P-BiLSTM decreased by 2.12 and 3.06, respectively, and R2 rose by 2.37%. The reason for this is that the correlation between temperature and traffic flow and rainfall and traffic flow is low, but the inclusion of whether it is a weekday will improve the prediction accuracy due to the strong influence of weekends on traffic flow. Because PMM uses linear regression for imputation, its prediction accuracy decreases more significantly when factors with lower correlations are added. The P-BiLSTM exhibited the most significant decrease in prediction performance in the scenario with 30% randomly missing data, when all factors were considered. The MAE and RMSE increased by 11.33 and 11.85, respectively, while R2 decreased by 9.96%. Similarly, in the scenario with 30% non-randomly missing data, the MAE and RMSE decreased by 9.83 and 14.21, respectively, while R2 increased by 12.17%. Another reason is that the model does not extract enough sample features, resulting in a larger error. In addition, the conclusion that the prediction error increases with the increase in missing rate is further verified in this part of the experiment. Finally, it is clear from Figure 7 and Figure 8 that when the same factors are added at the same missing rate, the prediction accuracy of K-BiLSTM outperforms that of P-BiLSTM regardless of the missing data. This indicates that K-BiLSTM is more suitable for traffic flow prediction with missing data. In Figure 9, the R2 distribution of the predicted results with the inclusion of different variables is depicted. With an increasing missing rate, the inclusion of rainfall intensity and temperature leads to a significant decline in prediction accuracy, particularly pronounced with P-BiLSTM.

5. Conclusions

Data loss is inevitable in the process of traffic flow data collection. Therefore, it is necessary to simulate the data loss. Random and non-random data loss scenarios were set up in the experiments. To achieve imputation and prediction, we combined KNN, PMM and RNN, and GRU, LSTM and BiLSTM to achieve data estimation and traffic flow prediction. The K-BiLSTM was experimentally demonstrated to be more accurate than the other models in terms of prediction. In addition, in the experiments where multiple factors were added to the models, the results showed that performance is only improved by including whether or not it is a working day into the model. Especially in the case of missing 30% of data at non-random, the MAE and RMSE of the K-BiLSTM model, considering whether it was a working day, decreased by a maximum of 2.11 and 3.36, respectively, while the R2 increased by 1.37%. Similarly, the MAE and RMSE of the P-BiLSTM model decreased by 2.12 and 3.06, respectively, and the R2 increased by 2.37%. This relationship was attributed to the correlation between traffic flow data and external factors. The inclusion of factors with low correlation led to an increase in prediction error.
In the future, the following research area will be explored: First, in subsequent studies, the traffic flow prediction can be extended from a single point to the whole urban road network for interpolation prediction. Second, a fusion model that can achieve both imputation and prediction will be studied and proposed. Finally, it is important to improve the application of the model to achieve accurate and efficient prediction under different road conditions.

Author Contributions

Conceptualization, W.Z. and R.C.; methodology, W.Z. and K.W.; software, W.Z.; validation, J.Z., K.W. and R.C.; formal analysis, R.C.; investigation, R.C.; resources, J.Z.; data curation, J.Z.; writing—original draft preparation, K.W.; writing—review and editing, W.Z.; visualization, J.Z.; supervision, R.C.; project administration, R.C.; funding acquisition, R.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Natural Science Foundation of Zhejiang Province, China (Grant No. LY22G010001) and National “111” Centre on Safety and Intelligent Operation of Sea Bridge (D21013) and the Healthy & Intelligent Kitchen Engineering Research Center of Zhejiang Province and the K.C. Wong Magna Fund in Ningbo University, China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data can be obtained from https://mostwiedzy.pl/pl/open-research-data.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Ji, Q.; Lyu, H.; Yang, H.; Wei, Q.; Cheng, R.J. Bifurcation control of solid angle car- following model through a time-delay feedback method. J. Zhejiang Univ. Sci. A 2023, 1, A2300026. [Google Scholar]
  2. Cheng, R.J.; Lyu, H.; Zheng, Y.X.; Ge, H. Modeling and stability analysis of cyberattack effects on heterogeneous intelligent traffic flow. Phys. A Stat. Mech. Its Appl. 2022, 604, 127941. [Google Scholar] [CrossRef]
  3. Wang, T.; Cheng, R.J.; Wu, Y. Stability analysis of heterogeneous traffic flow influenced by memory feedback control signal. Appl. Math. Model. 2022, 109, 693–708. [Google Scholar] [CrossRef]
  4. Vlahogianni, E.I.; Karlaftis, M.G.; Golias, J.C. Short-term traffic forecasting: Where we are and where we’re going. Transp. Res. Part C Emerg. Technol. 2014, 43, 3–19. [Google Scholar] [CrossRef]
  5. Giraka, O.; Selvaraj, V.K. Short-term prediction of intersection turning volume using seasonal ARIMA model. Transp. Lett. 2019, 12, 483–490. [Google Scholar] [CrossRef]
  6. Xu, X.C.; Jin, X.F.; Xiao, D.Q.; Ma, C.; Wong, S.C. A hybrid autoregressive fractionally integrated moving average and nonlinear autoregressive neural network model for short-term traffic flow prediction. J. Intell. Transp. Syst. 2021, 27, 1–18. [Google Scholar] [CrossRef]
  7. Raza, A.; Zhong, M. Hybrid artificial neural network and locally weighted regression models for lane-based short-term urban traffic flow forecasting. Transp. Plan. Technol. 2018, 41, 901–917. [Google Scholar] [CrossRef]
  8. Zhang, W.B.; Yu, Y.H.; Qi, Y.; Shu, F.; Wang, Y. Short-term traffic flow prediction based on spatio-temporal analysis and CNN deep learning. Transp. A Transp. Sci. 2019, 15, 1688–1711. [Google Scholar] [CrossRef]
  9. Yang, D.; Chen, K.R.; Yang, M.N.; Zhao, X. Urban rail transit passenger flow forecast based on LSTM with enhanced long-term features. IET Intell. Transp. Syst. 2019, 13, 1475–1482. [Google Scholar] [CrossRef]
  10. Liu, M.H.; Li, L.N.; Li, Q.; Bai, Y.; Hu, C. Pedestrian flow prediction in open public places using graph convolutional network. ISPRS Int. J. Geo-Inf. 2021, 10, 455. [Google Scholar] [CrossRef]
  11. Zhang, D.; Kabuka, M.R. Combining weather condition data to predict traffic flow: A GRU—based deep learning approach. IET Intell. Transp. Syst. 2018, 12, 578–585. [Google Scholar] [CrossRef]
  12. Chen, L.; Shao, W.; Lv, M.Q.; Chen, W.Q.; Zhang, Y.; Yang, C. AARGNN: An Attentive Attributed Recurrent Graph Neural Network for Traffic Flow Prediction Considering Multiple Dynamic Factors. IEEE Trans. Intell. Transp. Syst. 2022, 23, 17201–17211. [Google Scholar] [CrossRef]
  13. He, Y.X.; Li, L.S.; Zhu, X.T.; Tsui, K.L. Multi-Graph Convolutional-Recurrent Neural Network (MGC-RNN) for Short-Term Forecasting of Transit Passenger Flow. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18155–18174. [Google Scholar] [CrossRef]
  14. Khan, Z.; Khan, S.M.; Dey, K.; Chowdhury, M. Development and evaluation of recurrent neural network-based models for hourly traffic volume and annual average daily traffic prediction. Transp. Res. Rec. 2019, 2673, 489–503. [Google Scholar] [CrossRef]
  15. Yan, J.H.; Li, H.H.; Bai, Y.H.; Lin, Y. Spatial-Temporal Traffic Flow Data Restoration and Prediction Method Based on the Tensor Decomposition. Appl. Sci. 2021, 11, 9220. [Google Scholar] [CrossRef]
  16. Goulart, J.H.M.; Kibangou, A.Y.; Favier, G. Traffic data imputation via tensor completion based on soft thresholding of Tucker core. Transp. Res. Part C Emerg. Technol. 2017, 85, 348–362. [Google Scholar] [CrossRef]
  17. Baggag, A.; Abbar, S.; Sharma, A.; Zanouda, T.; Al-Homaid, A.; Mohan, A.; Srivasatava, J. Learning Spatiotemporal Latent Factors of Traffic via Regularized Tensor Factorization: Imputing Missing Values and Forecasting. IEEE Trans. Knowl. Data Eng. 2019, 33, 2573–2587. [Google Scholar] [CrossRef]
  18. Yang, J.M.; Peng, Z.R.; Lin, L. Real-time spatiotemporal prediction and imputation of traffic status based on LSTM and Graph Laplacian regularized matrix factorization. Transp. Res. Part C Emerg. Technol. 2019, 129, 103228. [Google Scholar] [CrossRef]
  19. Zhao, J.H.; Nie, Y.W.; Ni, S.J.; Sun, X. Traffic Data Imputation and Prediction: An Efficient Realization of Deep Learning. IEEE Access 2020, 8, 46713–46722. [Google Scholar] [CrossRef]
  20. Cui, Z.Y.; Ke, R.M.; Pu, Z.Y.; Wang, Y. Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp. Res. Part C Emerg. Technol. 2020, 118, 102674. [Google Scholar] [CrossRef]
  21. Cui, Z.Y.; Lin, L.F.; Pu, Z.Y.; Wang, Y. Graph Markov network for traffic forecasting with missing data. Transp. Res. Part C Emerg. Technol. 2020, 117, 102671. [Google Scholar] [CrossRef]
  22. Zhang, Z.C.; Lin, X.; Li, M.; Wang, Y. A customized deep learning approach to integrate network-scale online traffic data imputation and prediction. Transp. Res. Part C Emerg. Technol. 2021, 132, 103372. [Google Scholar] [CrossRef]
  23. Dong, H.X.; Ding, F.; Tan, H.X.; Zhang, H. Laplacian integration of graph convolutional network with tensor completion for traffic prediction with missing data in inter-city highway network. Phys. A Stat. Mech. Its Appl. 2022, 586, 126474. [Google Scholar] [CrossRef]
  24. Beretta, L.; Santaniello, A. Nearest neighbor imputation algorithms: A critical evaluation. BMC Med. Inform. Decis. Mak. 2016, 16, 197–208. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Rubin, D.B.; Schenker, N. Multiple imputation for interval estimation from simple random samples with ignorable nonresponse. J. Am. Stat. Assoc. 1986, 81, 366–374. [Google Scholar] [CrossRef]
  26. Little, R.J.A. Missing-data adjustments in large surveys. J. Bus. Econ. Stat. 1988, 6, 287–296. [Google Scholar] [CrossRef]
  27. Vink, G.; Frank, L.E.; Pannekoek, J.; van Buuren, S. Predictive mean matching imputation of semicontinuous variables. Stat. Neerl. 2014, 68, 61–90. [Google Scholar] [CrossRef]
  28. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  29. Zhao, Z.; Chen, W.H.; Wu, X.N.; Chen, P.C.Y.; Liu, J. LSTM network: A deep learning approach for short-term traffic forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar] [CrossRef] [Green Version]
  30. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
  31. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 1961, 100, 441. [Google Scholar] [CrossRef]
  32. Ma, C.X.; Dai, G.W.; Zhou, J.B. Short-term traffic flow prediction for urban road sections based on time series analysis and LSTM_BILSTM method. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5615–5624. [Google Scholar] [CrossRef]
  33. Qi, Q.R.; Cheng, R.J.; Ge, H.X. Short-Term Travel Demand Prediction of Online Ride-Hailing Based on Multi-Factor GRU Model. Sustainability 2022, 14, 4083. [Google Scholar] [CrossRef]
Figure 1. The structure of the LSTM. (a) The overall structure of the LSTM. (b) Internal structure of the LSTM unit.
Figure 1. The structure of the LSTM. (a) The overall structure of the LSTM. (b) Internal structure of the LSTM unit.
Sustainability 15 11092 g001
Figure 2. The structure of the BiLSTM.
Figure 2. The structure of the BiLSTM.
Sustainability 15 11092 g002
Figure 3. The structure of K-BiLSTM and P-BiLSTM models.
Figure 3. The structure of K-BiLSTM and P-BiLSTM models.
Sustainability 15 11092 g003
Figure 4. Location of the traffic counting station on S6 expressway in Tricity.
Figure 4. Location of the traffic counting station on S6 expressway in Tricity.
Sustainability 15 11092 g004
Figure 5. The correlation coefficients between traffic flow and variables.
Figure 5. The correlation coefficients between traffic flow and variables.
Sustainability 15 11092 g005
Figure 6. The correlation coefficient between daily traffic flow.
Figure 6. The correlation coefficient between daily traffic flow.
Sustainability 15 11092 g006
Figure 7. Comparison of prediction results considering different combinations of factors under random missing scenario. (a) Comparison of prediction results under 10% random missing rate, (b) comparison of prediction results under 20% random missing rate, and (c) comparison of prediction results under 30% random missing rate.
Figure 7. Comparison of prediction results considering different combinations of factors under random missing scenario. (a) Comparison of prediction results under 10% random missing rate, (b) comparison of prediction results under 20% random missing rate, and (c) comparison of prediction results under 30% random missing rate.
Sustainability 15 11092 g007aSustainability 15 11092 g007b
Figure 8. Comparison of prediction results considering different combinations of factors under non-random missing scenario. (a) Comparison of prediction results under 10% non-random missing rate, (b) comparison of prediction results under 20% non-random missing rate, and (c) comparison of prediction results under 30% non-random missing rate.
Figure 8. Comparison of prediction results considering different combinations of factors under non-random missing scenario. (a) Comparison of prediction results under 10% non-random missing rate, (b) comparison of prediction results under 20% non-random missing rate, and (c) comparison of prediction results under 30% non-random missing rate.
Sustainability 15 11092 g008aSustainability 15 11092 g008b
Figure 9. Distribution of R2 of predicted results. (a) Under random missing scenario, and (b) under non-random missing scenario.
Figure 9. Distribution of R2 of predicted results. (a) Under random missing scenario, and (b) under non-random missing scenario.
Sustainability 15 11092 g009
Table 1. Data sample table.
Table 1. Data sample table.
DateTimeTraffic VolumeTemperatureRain IntensityWorking Day
27 October 20140:00–0:051711.101
27 October 20140:05–0:102311.101
27 October 20140:10–0:151611.101
27 October 20140:15–0:201111.101
27 October 20140:20–0:251011.101
Table 2. The relationship between the r value and the correlation strength.
Table 2. The relationship between the r value and the correlation strength.
r ValueCorrelation Strength
|r| = 0completely irrelevant
0 < |r| ≤ 0.3basically irrelevant
0.3 < |r| ≤ 0.5low correlation
0.5 < |r| ≤ 0.8highly correlated
|r| = 1completely relevant
Table 3. Setting of the random missing scenario.
Table 3. Setting of the random missing scenario.
29 October 201430 October 201431 October 20141 November 2014
0:30–0:351918NA13
0:35–0:4019NA1617
0:40–0:457810NA
0:50–0:559NA915
0:55–1:00698NA
1:00–1:05NA15NA11
1:05–1:1012101011
Table 4. Setting of the non-random missing scenario.
Table 4. Setting of the non-random missing scenario.
29 October 201430 October 201431 October 20141 November 2014
0:30–0:3519181813
0:35–0:4019131617
0:40–0:45NANANA20
0:50–0:55NANANA15
0:55–1:00NANANA13
1:00–1:05NANANA11
1:05–1:10NANANA11
Table 5. Detailed description of parameters.
Table 5. Detailed description of parameters.
ParameterValue
Number of each hidden layer neurons24
Training epochs50
Activation function of fully connected layerTanh
Input length12
Batch size32
Learning rate0.001
OptimizerAdam
Table 6. Prediction results under random missing scenario.
Table 6. Prediction results under random missing scenario.
Model10% 20% 30%
MAERMSER2MAERMSER2MAERMSER2
K-RNN13.2918.4294.6313.6319.2192.8414.6620.5992.92
P-RNN13.3218.2394.4513.8319.1693.8515.7721.8791.84
K-GRU13.2118.5494.6613.7119.0593.9415.3221.5591.95
P-GRU13.3718.8694.6413.7919.4592.8915.5121.9191.99
K-LSTM12.6417.8394.9313.5819.3793.7514.8221.1892.52
P-LSTM12.6617.9694.8113.9519.3793.4215.3221.6792.16
K-BiLSTM12.1817.2995.0113.1518.7394.1514.1220.1793.81
P-BiLSTM12.2117.4694.9613.6519.5693.6115.2021.2692.71
Table 7. Prediction results under non-random missing scenario.
Table 7. Prediction results under non-random missing scenario.
Model10% 20% 30%
MAERMSER2MAERMSER2MAERMSER2
K-RNN12.9717.8394.2713.4319.4593.4914.4721.9591.97
P-RNN12.3817.0594.6412.6818.4694.3114.9821.6392.69
K-GRU12.5317.6795.0213.982093.3214.8421.1492.35
P-GRU12.3417.5594.9713.7419.5293.2714.0321.0292.62
K-LSTM12.1717.1994.9213.2619.3993.7313.9020.6793.04
P-LSTM12.3116.9195.2213.1919.2493.8213.9520.6292.91
K-BiLSTM11.7116.5295.4412.3417.5794.8513.5720.1793.21
P-BiLSTM11.9416.895.2912.5117.8994.6613.6920.4093.06
Table 8. Prediction results with different alternative combinations of external factors as in put under random missing scenario.
Table 8. Prediction results with different alternative combinations of external factors as in put under random missing scenario.
Model10% 20% 30%
MAERMSER2MAERMSER2MAERMSER2
K-BiLSTM12.1817.2995.0113.1518.7394.1514.1220.1793.81
P-BiLSTM12.2117.4694.9613.6519.5693.6115.2021.2692.71
K-BiLSTM13.9518.5894.2314.4120.5392.9615.0321.6392.18
(z)
P-BiLSTM14.7314.7293.3419.8726.4388.3423.9029.4685.51
(z)
K-BiLSTM13.4218.8894.0615.2622.0691.8715.7622.3090.94
(r)
P-BiLSTM14.8720.7592.8117.0524.1090.3121.0328.5386.39
(r)
K-BiLSTM11.2815.1296.0411.7617.5795.6511.9817.0195.02
(w)
P-BiLSTM11.8516.4495.4912.2117.0695.1413.0617.5794.85
(w)
K-BiLSTM13.1318.3694.0214.2320.2893.1514.9920.7792.89
(z, w)
P-BiLSTM14.5519.8693.4119.0324.9189.6423.2529.1786.72
(z, w)
K-BiLSTM13.3318.5994.2314.9020.7592.8115.3521.6692.17
(z, r)
P-BiLSTM15.420.9692.6821.12787.8424.5831.4983.45
(z, r)
K-BiLSTM13.7319.2693.9914.0419.8393.4414.3720.1693.27
(w, r)
P-BiLSTM15.5421.5392.2619.7225.4388.3420.8827.6384.26
(w, r)
K-BiLSTM13.2519.2594.8214.6220.4593.0415.0321.3492.47
(w, r, z)
P-BiLSTM15.8921.0492.6119.3025.3489.2926.5333.1182.75
(w, r, z)
Table 9. Prediction results with different alternative combinations of external factors as input under non-random missing scenario.
Table 9. Prediction results with different alternative combinations of external factors as input under non-random missing scenario.
Model10% 20% 30%
MAERMSER2MAERMSER2MAERMSER2
K-BiLSTM11.7116.5295.4412.3417.5794.8513.5720.1793.21
P-BiLSTM11.9416.895.2912.5117.8994.6613.6920.4093.06
K-BiLSTM13.1718.5594.0613.3219.5993.5914.3621.6492.18
(z)
P-BiLSTM13.8720.7292.8417.0725.6589.9418.5728.6486.31
(z)
K-BiLSTM13.2518.6394.2113.7620.5192.9715.3522.8491.31
(r)
P-BiLSTM13.9119.8893.4115.2424.0190.3818.8129.1985.78
(r)
K-BiLSTM11.7816.3795.5212.2516.6995.1712.4617.0295.01
(w)
P-BiLSTM11.8416.7395.2412.3517.2694.9913.2619.3093.78
(w)
K-BiLSTM13.2318.6494.2114.4119.7492.7515.3421.6792.04
(z, w)
P-BiLSTM13.8620.9992.6418.9128.1286.8122.7433.0481.67
(z, w)
K-BiLSTM13.0618.6494.2114.4220.2993.1315.5322.8391.3
(z, r)
P-BiLSTM15.4622.7591.3619.5828.986.0622.4133.2481.55
(z, r)
K-BiLSTM13.2719.2693.0614.0820.1692.6414.7722.5691.26
(w, r)
P-BiLSTM14.5221.3292.3217.6024.6189.9121.1630.0884.91
(w, r)
K-BiLSTM13.1718.5094.3213.3419.5193.6515.3421.2592.61
(w, r, z)
P-BiLSTM14.3421.5892.2220.3929.0985.8823.4234.6180.01
(w, r, z)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeng, W.; Wang, K.; Zhou, J.; Cheng, R. Traffic Flow Prediction Based on Hybrid Deep Learning Models Considering Missing Data and Multiple Factors. Sustainability 2023, 15, 11092. https://doi.org/10.3390/su151411092

AMA Style

Zeng W, Wang K, Zhou J, Cheng R. Traffic Flow Prediction Based on Hybrid Deep Learning Models Considering Missing Data and Multiple Factors. Sustainability. 2023; 15(14):11092. https://doi.org/10.3390/su151411092

Chicago/Turabian Style

Zeng, Wenbao, Ketong Wang, Jianghua Zhou, and Rongjun Cheng. 2023. "Traffic Flow Prediction Based on Hybrid Deep Learning Models Considering Missing Data and Multiple Factors" Sustainability 15, no. 14: 11092. https://doi.org/10.3390/su151411092

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop