Monte Carlo simulation-based traffic speed forecasting using historical big data
Introduction
We live among a variety of big data sources such as data from GPS signals, closed-circuit television (CCTV), traffic loop detectors, climate sensors, credit card payment information, and social network services [1], [2]. In particular, we can know the location of currently congested roads and find shortest routes in real-time by using real-time traffic data. These data have been provided by Intelligent Transportation System (ITS) sensor devices like Vehicle Detection Systems and Dedicated Short-Range Communications (DSRC) over the past few decades [3], [4], [5], [6]. Recently, because of the growing number of vehicles and indiscriminately congested roads, most people would like to know the traffic conditions of the next day or day of the week in advance. Currently, we should be able to estimate the travel time of routes composed of links (e.g., separated at an intersection where traffic flow changes) if we predict traffic speeds for all times and links connecting from a start node to an end node.
However, the more difficult job of traffic flow prediction is to predict the traffic flow of the next day of the week in addition to the continuous prediction of on-going traffic flow. The prediction of on-going traffic flow is an easier job than non-continuous prediction. For example, suppose that we would like to predict the flow within one hour by using real-time data. Because the traffic flow does not unexpectedly change, it is easy to predict the flow by using one-hour old or current data. On the other hand, when we would like to predict the rush hour flow on the next day or next Monday, it is more difficult to predict the non-continuous traffic flow from historical traffic information because of the many variables affecting the traffic flow, including traffic congestion that could happen over the next couple of days. Namely, the prediction of future traffic flow over the next day of the week depends upon an irregular pattern of traffic flow.
To clarify the problem of prediction, we refer to the prediction of on-going continuous traffic flow as continuous prediction (CP) because it uses on-going continuous traffic data within the same day. On the other hand, because the historical pattern of traffic flow is used for predicting a particular day of the week, we refer to the non-continuous prediction over the upcoming day of the week as non-continuous prediction (NCP). Recently, some ITSs have begun storing historical data and attempting to predict the travel time of congested roads by using this data [7], [8]. However, it is difficult to choose an optimal set of historical data to use as input in non-continuous prediction since the prediction accuracy of a day’s traffic can vary depending on the choice of the data. In general, the historical data consists of two patterns: normal patterns and abnormal patterns, which are completely different patterns caused by specific events such as festivals or accidents during specific time intervals. Moreover, because of road construction over a certain period of time, even traffic flow with a normal pattern could be completely changed afterward. This is called a big change pattern. These abnormal and big change patterns are outliers that can considerably affect the prediction accuracy.
To solve the historical data selection issue, the historical data should be classified with other data that have normal patterns while outliers are simultaneously removed. In this paper, we propose a three-step filtering algorithm that first determines large changes in the patterns and excludes data before these large changes. It then removes historical data with low correlation coefficients. The final step randomly combines the remaining data by using Monte Carlo simulation to determine the best input combination. Furthermore, we finalize the data selection by determining the optimal historical data range, using the decision factors of each method. For example, suppose that we have one hundred historical data (2014.01.01 to 2014.04.10) in a link that sharply changes on 2014.02.16 because of road construction and for which the flow between weekends and weekdays is completely different. In the first step, we exclude fifty data (2014.01.01 to 2014.02.16) because of the big changes in the patterns. To predict the flow of 2014.04.06, thirty-six weekday data are excluded in the remainder of the data according to correlation analysis. Lastly, thirteen data as final input data are selected by the simulation and decision factors.
In addition, we suggest a two-step verification to select the optimal time series forecasting methods. In the first step, the Mean Absolute Percentage Error (MAPE) of each method generated by simulation is compared and the predicted data combination with the smallest MAPE is selected. We determine the final optimal time series forecasting method by using three measures of the difference between the predicted data and forecasting day. In this paper, because of the continuously increasing historical traffic data on all roads, we focus on the “volume” aspect of big data. It is possible to predict traffic speed by selecting the optimal historical dataset from the big data, using distributed parallel processing.
The main contributions of this paper can be summarized as follows.
- •
We propose a new statistical model for generating prediction data with high accuracy, removing input data outliers through correlation analysis and Monte Carlo simulation, applying a time-series forecasting method to each link and day, and generating the best prediction data for each time duration by using the Mean Squared Error (MSE) and Akaike Information Criterion (AIC). Finally, we verify our modeling by using the cross-validation of three measures: MAPE, -squared, and Root MSE (RMSE). These measure prediction accuracy, that is, they measure how well the data are predicted.
- •
We construct a forecasting system based on big data open source tools. This system consists of Hadoop, RHive, and Hive for data processing and R as the statistical analysis package. It performs all analysis from historical data insertion to verification of the predicted data. Furthermore, because it is necessary to reduce loads that frequently call MapReduce jobs to use raw data stored in Hadoop during various analyses, we store the raw data in an R data file to increase the processing speed of analyses and calculations.
The remainder of this paper is organized as follows. Section 2 presents the background to understanding traffic data and analyzes the characteristics of that data. Section 3 defines the problems and limitations of research on data and methods. In addition, Section 4 and 5 describe our statistical model and system architecture in detail via an example. Section 6 presents our experiments that use several scenarios. In Section 7, we review various existing research for time series forecasting methods. Section 8 presents a summary of our approach and contributions. Finally, Section 9 concludes the paper.
Section snippets
Traffic data basics
In this section, we first introduce the standard concepts of nodes and links in ITS, and then describe the traffic data collected from ITS. Finally, we present various traffic patterns.
Problems with current time series forecasting methods
In this section, we identify three issues: big changes at macro scale, data arrangement from a data usage perspective, and the number of applied models from a model perspective.
Outline of proposed statistical model
In this section, we describe the overall process, from the selection of input data and the analysis of big traffic data to the verification of the predicted data from the perspective of data analysis and processing.
System architecture for analyzing and predicting traffic data
Given big historical traffic data with tens of thousands of links, it is necessary to build a system for analyzing them and generating short-term forecast data. To do this, we first consider the input and output of the system. Because big historical traffic data is the input, we need sufficient storage. Accordingly, we use a local file system to store the data directly imported from Busan ITS and the Hadoop Distributed File System (HDFS) [25]. Next, it is necessary to implement suitable tools
Evaluation
This section describes the hardware of the system computers, experimental evaluation scenarios, and evaluation of the results by using MAPE, RMSE, and -squared.
Related work
In this section, we introduce general time-series forecasting methods from various domains with regard to transportation in particular. Forecast methods can be generally divided into statistical approaches and machine learning approaches, which are in turn composed of supervised and unsupervised learning [30], [31], [32].
Summary of our approach and contributions
There is a large amount of research on predicting traffic flow, and these studies almost always use sample traffic data and just one method. These methods may be able to predict fragments of services, but are not suitable for universal use because each road has a distinct spatio-temporal pattern. In addition, they overlook large changes of traffic flow caused by a large amount of construction. Because this affects the prediction accuracy, it is one of the reasons that historical data must be
Conclusion
In this paper, we determined that the historical patterns for the following few days should use a different prediction model, in which the behavior is dependent upon the spatial–temporal patterns of every road interval. For more accurate prediction, we argued that selection of the historical traffic data and suitable time series forecasting methods are necessary. To select data, we proposed a three-step filtering algorithm based on changepoint analysis, correlation analysis, and Monte Carlo
Acknowledgments
This research was supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2015-H8501-15-1011) supervised by the IITP (Institute for Information & communications Technology Promotion).
Seungwoo Jeon received B.S. and M.S. degrees in Computer Engineering from Pusan National University (PNU), Busan, Korea, in 2006 and 2011, respectively. He is planning to develop traffic predictor software with Big Data Processing Platform Research Center(BDRC). His research focuses on traffic prediction, statistical model and big data processing software.
References (45)
- et al.
Intelligent services for big data science
Future Gener. Comput. Syst.
(2014) Intelligent big data processing
Future Gener. Comput. Syst.
(2014)The business intelligence as a service in the cloud
Future Gener. Comput. Syst.
(2014)- et al.
Statistical methods versus neural networks in transportation research: Differences, similarities and some insights
Transp. Res. C
(2011) - et al.
Application of support vector machines in financial time series forecasting
Omega
(2001) Financial time series forecasting using support vector machines
Neurocomputing
(2003)- et al.
Modified support vector machines in financial time series forecasting
Neurocomputing
(2002) - et al.
A survey of intelligent transportation systems
- S. Coleri, S.Y. Cheung, P. Varaiya, Sensor networks for monitoring traffic, in: Allerton Conference on Communication,...
- et al.
A survey on intelligent transportation systems
Middle-East J. Sci. Res.
(2013)
Data-driven intelligent transportation systems: A survey
IEEE Trans. Intell. Transp. Syst.
Modeling and forecasting vehicular traffic flow as a seasonal arima process: Theoretical basis and empirical results
J. Transp. Eng.
Development of an effective travel time prediction method using modified moving average approach
Short-term traffic flow forecasting using fuzzy logic system methods
J. Intell. Transp. Syst.
A Bayesian time-series model for short-term traffic flow forecasting
J. Transp. Eng.
Svm based multi-index evaluation for bus arrival time prediction
Big data processing for prediction of traffic time based on vertical data arrangement
Travel-time prediction with support vector regression
IEEE Trans. Intell. Transp. Syst.
Statistics without maths for psychology
Straightforward Statistics for the Behavioral Sciences
Cited by (43)
Comparing the vibrational behaviour of e-kick scooters and e-bikes: Evidence from Italy
2023, International Journal of Transportation Science and TechnologyA dynamic ensemble deep deterministic policy gradient recursive network for spatiotemporal traffic speed forecasting in an urban road network
2022, Digital Signal Processing: A Review JournalCitation Excerpt :With the development of computer technology, the traffic speed prediction model based on data drive has been paid more and more attention in academic circles. Jeon et al. [4] proposed a data drive traffic prediction model. Their experimental results showed that the error of the model was less than 20%, which was better than traditional models.
Hybrid short-term traffic forecasting architecture and mechanisms for reservation-based Cooperative ITS
2021, Journal of Systems ArchitectureCitation Excerpt :Over the past few decades, this subject has been also extensively studied, and some representative research contributions have been produced [12,34]. Typically, a set of forecasting methods have been considered with various theories, such as historical average and smoothing, statistical regression, nonparametric regression, and machine learning [35–39]. For example, based on an autoregressive integrated moving average (ARIMA) model, Kumar et al. [40] proposed a new short-term forecasting scheme with a seasonal ARIMA (SARIMA) model that only requires limited input data; before this, they sought to exploit ANNs using past traffic data [41].
How can transport impacts of urban growth be modelled? An approach to consider spatial and temporal scales
2020, Sustainable Cities and SocietyLong-term travel time prediction using gradient boosting
2020, Journal of Intelligent Transportation Systems: Technology, Planning, and OperationsInvestigating bus travel time and predictive models: A time series-based approach
2020, Transportation Research Procedia
Seungwoo Jeon received B.S. and M.S. degrees in Computer Engineering from Pusan National University (PNU), Busan, Korea, in 2006 and 2011, respectively. He is planning to develop traffic predictor software with Big Data Processing Platform Research Center(BDRC). His research focuses on traffic prediction, statistical model and big data processing software.
Bonghee Hong received M.S. and Ph.D. degrees in Computer Engineering from Seoul National University, Seoul, Korea, in 1984 and 1988 respectively. In 1987, he joined as a faculty of Computer Engineering of the Pusan National University (PNU). He is working as a Professor of database in the Department of Computer Engineering at the PNU. He is a director of Big Data Processing Platform Research Center(BDRC). He is also a steering committee member of DASFAA. His research interests include theory of database systems, moving object databases, spatial databases and big data processing for traffic prediction.