Short-term traffic speed prediction under different data collection time intervals using a SARIMA-SDGM hybrid prediction model

Short-term traffic speed prediction is a key component of proactive traffic control in the intelligent transportation systems. The objective of this study is to investigate the short-term traffic speed prediction under different data collection time intervals. Traffic speed data was collected from an urban freeway in Edmonton, Canada. A seasonal autoregressive integrated moving average plus seasonal discrete grey model structure (SARIMA-SDGM) was proposed to perform the traffic speed prediction. The model performance of SARIMA-SDGM model was compared with that of the seasonal autoregressive integrated moving average (SARIMA) model, seasonal discrete grey model (SDGM), artificial neural network (ANN) model, and support vector regression (SVR) model. The results showed that SARIMA-SDGM model performs best with the lowest mean absolute error (MAE), mean absolute percentage error (MAPE), and the root mean square error (RMSE). The traffic speed prediction accuracy under different time intervals were compared based on the SARIMA-SDGM model. The results showed that the prediction accuracy improves with the increase in time interval. In addition, when the time interval is greater than 10 min, the prediction results yield stable prediction accuracy.


Introduction
There has been an increasing growth in traffic demand over the past two decades around the world. Transportation engineers are being challenged by the ever-increasing traffic demand and the corresponding traffic congestion and safety issues [1][2][3][4]. Many solutions have been investigated to mitigate the traffic congestion, in which the proactive traffic control system is great importance and efficient [5]. Specifically, short-term traffic prediction is an important component of proactive traffic control system. Traffic parameters including traffic flow, occupancy and traffic speed are the dominate variables in short-term traffic prediction. PLOS  Although each of the three traffic parameters can be used to describe traffic congestion, both traffic flow and speed have correlated with occupancy [6]. Compared to the traffic flow, one speed is mapped to one occupancy, whereas one traffic flow can be mapped to two occupancies [7][8][9][10]. In addition, speed is more directly related to the traffic operation statues. Besides, the real-time dynamic traffic guidance control strategy relies on the short-term traffic speed prediction results. As such, short-term traffic speed prediction has been identified as a key task for developing proactive traffic control system. The specification of time intervals for data collection is a fundamental determinant of the nature and utility of traffic condition data. In the process of short-term traffic prediction, data collection time interval serves as the aggregation interval of traffic speed [11]. The data collection time interval provides the forecasting horizon for one-step-ahead forecasting. The accuracy of traffic prediction results highly lay on the data collection time interval. Nevertheless, the need for more rigorous understanding of the effects of data collection time interval specification within the context of short-term traffic condition forecasting is not well recognized. By contrast, it has been common practice in previous research to arbitrarily select the data collection time interval without consideration of time interval effects on the prediction results. Moreover, understanding the impact of data collection time interval on short-term traffic prediction can provide insights into the performance of prediction results. Moreover, different applications require different data collection time intervals. For example, predictive route guidance application requires a longer time interval, whereas traffic flow rate prediction needs a shorter time interval [12]. The data collection time interval is particularly important to the traffic speed prediction. The traffic speed prediction with a large time interval has limited capacity to reflect the dynamic traffic operation status. Thus, the prediction results are unable to be applied in traffic control strategy. Whereas, if the time interval is too small, the calculation is time consuming and the traffic speed prediction results are unstable. In addition, the collection process will result in missing information when the time interval is too small. As such, it is necessary to investigate the data collection time interval for short-term traffic prediction, especially for the traffic speed prediction where the speed data is discrete across time intervals.
The objective of this study is to investigate the short-term traffic speed prediction under different data collected time interval. Specifically, a seasonal autoregressive integrated moving average plus seasonal discrete grey model structure (SARIMA-SDGM) was proposed in this study. Speed data with various time intervals collected from an urban freeway in Edmonton, Canada were used. For model comparison purpose, four candidate methods, including seasonal autoregressive integrated moving average (SARIMA) model, seasonal discrete grey model structure (SDGM) model, artificial neural network (ANN) model, and support vector regression (SVR) model, were estimated and compared with SARIMA-SDGM model. Three indicators including the mean absolute error (MAE), mean absolute percentage error (MAPE), and the root mean square error (RMSE) were used to measure the models' performance as well as the impact of time interval on traffic speed prediction accuracy. The main contributions of the study are: (a) this paper investigate the short-term traffic speed prediction under different data collection time intervals; and (b) a SARIMA-SDGM hybrid prediction model was proposed this paper and compared to the traditional methods (i.e. SARIMA and SDGM) and machine learning methods (i.e. ANN model and SVR model).

Short-term traffic condition prediction methods
The past decades has seen a growth in the short-term traffic condition prediction studies. Various approaches have been applied in traffic condition forecasting. Traditionally, the parametric and nonparametric methods are two main methods that are used in short-term traffic condition predictions. A method can be thought paramedic when structure is fixed and parameters are learned from data set [13]. Likewise, nonparametric methods derive dynamic relationships directly from observed data and therefore are usually called data-driven approach.
The typical parametric methods are the autoregressive integrated moving average (ARIMA) model [14][15][16], and its extended structures, such as Kohonen-ARIMA model [17], seasonal autoregressive integrated moving average (SARIMA) model [18], and ARIMA with Kalman filter [19]. Other commonly used parametric methods include time series models [20] and spectral analysis [21][22]. The parametric methods are easy to be implemented and provide explicit theoretical interpretability with clear calculation construction. However, the parametric methods require high quality of data set. The traffic data sequence should be accurate and stable, which against the fact that the traffic data are stochastic and unstable. Therefore, these models are difficult to obtain accurate prediction results from the actual traffic data.
Comparing to the parametric methods, nonparametric methods derives the prediction results directly from data training. Due to the learning ability and strong generalization, the nonparametric methods are able to achieve better prediction accuracy. Numerous methods are used as the nonparametric methods including, the k-nearest neighbor approach [23][24], multi-type neural network [25][26], artificial neural network (ANN) model [27], kernel smoothing [28], and support vector regression (SVR) model [29]. Nonparametric methods enable the adaptive learning of potential traffic dynamics through historical traffic data, and have the desirable attribute of adapting to changing traffic condition. However, concerns with these methods are black box framework, difficult in model training. Besides, expanding the database needed for the adaptation decreases the computational efficiency.
Considering that each prediction method has its own application and advantage, recent studies have utilized the hybrid methods combining merits of different methods in short-term traffic condition prediction to improve the prediction accuracy. These methods include hybrid fuzzy rule-based approach [30], Bayesian-neural network approach [31], and chaos-wavelet analysis-support vector machine approach [32]. Generally, the hybrid prediction model can achieve better results than single prediction model. Moreover, the hybrid models are verified with higher prediction accuracy [33][34].

Short-term traffic speed prediction
Numerous studies have investigated the short-term traffic speed prediction which is a kind of time series prediction [35]. Linear time series models have been widely used, including ARIMA model [14][15]36], the seasonal ARIMA (SARIMA) model [37], and the exponential smoothing model [38]. However, the above-mentioned linear time series models require accurate and stable traffic speed data, whereas the actual traffic speed data are nonlinear and unstable. Therefore, these models cannot implement accurate forecast for traffic speed data that have nonlinear structure.
In recent years, with the development of machine learning technology, various machinelearning models have been adopted in traffic speed prediction. These models include support vector regression (SVR) [29], long short-term memory networks (LSTM) [39][40], and evolving fuzzy neural network (EFNN) [33]. Wang et al. [41] proposed a bidirectional long shortterm memory neural network (Bi-LSTM NN) model in traffic speed prediction. The results showed that the proposed model outperforms ANN model. Ma et al. [42] utilized a convolutional neural network (CNN) to predict network-wide traffic speed. The results showed that the proposed method outperformed LSTM model by a mean squared errors improvement of 42.91%. Using the traffic speed data from the Caltrans Performance Measurement System (PeMS), Liu et al. [43] predicted traffic speed by the attention convolutional neural network (ACNN) model and found that the proposed model achieved better forecast results than traditional linear models.
In addition to the time series features, traffic speed is also influenced by geographical location and spatial correlation. Thus, the prediction models which consider the spatial features were proposed. These models include vector autoregressive (VAR) model [44], statistical analysis model (SAM) [45], the grey prediction model with Fourier error correction (EFGM) [46], and the grey prediction model with Markov chain (MKGM) [47]. In these models, the prediction results were achieved by exploring the road network and capturing the correlation information of the network.
Hybrid models were also applied in short-term traffic speed prediction. The temporalspatial hybrid model was proposed to provide a complete description of the temporalspatial interaction [48]. The spatial-temporal random effects (STRE) model was applied in traffic speed prediction by considering the spatial-temporal features of traffic speed [49]. The deep learning method combined with median filter preprocessing model (DLM8L) uses convolutional neural network (CNN) to extract temporal-spatial features and forecast traffic speed in highway [50]. Intuitively, the hybrid models can achieve better prediction results than single models [33][34]37]. However, the estimation of the hybrid models is complex and require more effort, thereby discouraging the wide-scale implementation [21].
The literature review showed that most of short-term traffic speed predictions are based on time series models, spatial correlation models, and hybrid models. Compared to a single shortterm traffic speed prediction model, a hybrid model can provide complex interpretability but achieve better accurate results. Few studies investigated the data collection time interval in short-term traffic prediction. However, different data collection time interval may have impact on the traffic speed prediction results.

Methodology
This study proposes a hybrid prediction model framework by combining the SARIMA model with SDGM model to deal with traffic speed based on temporal and spatial seasonal characteristics.

SARIMA model
SARIMA model is a commonly used time-series prediction method proposed by Box et al. [51]. As an improved form of ARIMA model, SARIMA model is used for periodic time series and performs the seasonal difference based on the ARIMA model. In addition, SARIMA model has been shown to effectively capture the seasonal feature of the time series, especially in the traffic speed time series [33,34,37,52].
Based on the ARIMA(p, d, q) model which includes autoregressive (AR) algorithm and moving average (MA) algorithm, the SARIMA (p, d, q)(P, D, Q) model can be defined in (1). In this study, the SARIMA model is used to remove the autocorrelation structure from the time series so as to generate the residual series for the statistical tests in the heteroscedasticity test.
where t is time index; ε t is the residual series; p is order of the short-term AR polynomial; q is order of the short-term MA polynomial; d is order of the short-term differencing; P is order of the seasonal AR polynomial; Q is order of the seasonal MA polynomial; D is order of the seasonal differencing, B is backshift operator such that BX t = X t−1 = ε t = random error at time t; For the processing of SARIMA model, three steps are used in the Box-Jenkins framework, i.e., model identification, model estimation, and model prediction [51]. In the model identification step, the periodic features of time series are identified. The periodic features are regarded as the criteria for applying the model [33][34]53]. In the model estimation step, the model parameters are estimated using the maximum likelihood approach or least squares approach. In the model prediction step, forecast was obtained by the estimated model. In this study, these three steps are implemented using the SAS PROC [54]. The SARIMA algorithm in SAS is shown in the Algorithm 1.

SDGM model
The discrete grey model (DGM) is used to predict the cross-sectional data. However, if the original sequence is a seasonal sequence, the DGM is unable to capture the oscillation of the data, leading to poor prediction accuracy [55]. Therefore, the cycle truncation accumulated generating operation (CTAGO) is introduced as shown in Fig 1, The SDGM model which is an improved form of the DGM, considering the CTAGO operator is proposed. Assume that x i (0) is an original, seasonal sequence at cross-section i, y i (0) represents the CTAGO sequence can be given by (2), q is periodic value, and mark n-q+1 is r.
where n is total number of parameter; q is the periodic value; k is parameter number index; i is the cross-sectional position; x The sequence y i (1) can be calculated based on the first-order accumulated generating operation (1-AGO) as shown in the (3).
where t is parameter number index; and y ð1Þ i ðkÞ ¼ ðy ð1Þ i ð1Þ; y ð1Þ i ð2Þ; � � � ; y ð1Þ i ðrÞÞ T is the 1-AGO sequence of CTAGO sequence. By combining (2) and (3), the following equation is obtained.
In the above representation, Eq (3) defines the 1-AGO structure of CTAGO sequence. Eq (4) shows the relationship between 1-AGO sequence of CTAGO sequence y i (1) and original sequence x i (0) . As shown in (3) and (4), the sequences y i (1) is an ascending sequences. Therefore, Eq (5) is used to define the sequence increment relationship structure as follows.
where β 1 is the coefficient of least-squares estimation; and β 2 is the coefficient of leastsquares estimation. The coefficients β 1 and β 2 can be estimated by (6) and (7). .
The relationship between y i (1) and original sequences x i (0) , can be calculated by (8) and (9). By combining (7) to (9), the solving process of coefficients β 1 and β 2 can be converted into the (10).b The solution of SDGM is proposed by the (11). The time response structure of CTAGO sequence can be presented by (12). Eq (13) defines the solution of the corresponding seasonal original sequence x i (0) after the inverse operation.
wherex ð0Þ i ðtÞ is the original sequence predicted by using SDGM;ŷ ð0Þ i ðtÞ is the CTAGO sequence predicted by using SDGM; andŷ ð1Þ i ðtÞ is the I-AGO sequence of CTAGO sequence predicted by using SDGM.

SARIMA-SDGM hybrid model
In this study, a SARIMA-SDGM hybrid model was proposed for short-term traffic speed prediction. In practical, SARIMA model is used to forecast the periodic time series data. SDGM (1,1) is used to forecast the cross-sectional data that has weekly seasonal characteristics. The structure of hybrid model is given in (14).
where t is the time index; V t is the predicted value by using hybrid model; V sarima t is the predicted value by using SARIMA; V sdgm t is the predicted value by using SDGM; w sarima t is the weighted value by using SARIMA; and w sdgm t is the weighted value by using SDGM. The weight in the hybrid model is determined by the performance of the single model prediction at time t. The lower nearness degree between the actual value and predicted value is, the smaller the weight is. The weight algorithm of hybrid prediction model is as follows.
Step 1, Estimating the prediction value by SARIMA model and SDGM model as given in (15) and (16), respectively. ðkÞ is the predicted data sequence by using SDGM.
Step 2, Calculating the corresponding nearness degree r sarima t and r sdgm t as given in (17).
where r sarima t is the nearness degree by using SARIMA; and r sdgm t is the nearness degree by using SDGM.
Step 3, Determining the corresponding weighted coefficients by the nearness degree as given in (18).
where w sarima t is the weighted value by using SARIMA; and w sdgm t is the weighted value by using SDGM.

Machine learning methods
Two machine learning methods, including ANN model and SVR model, were introduced for comparison.
ANN is a data-driven model and has the capability of complex mapping between inputs and outputs that enables appropriating nonlinear functions [56]. The basic structure of ANN model consists of multiple layers, including one input layer, one output layer, and one or more hidden layers. Each layer comprises several nodes connected to the nodes in neighboring layers. With the application of ANN model, the inputs can be previous lagged traffic speed values while the outputs can provide future traffic speed forecasts. The input-output relation of neural network models for prediction can be represented as followŝ where v(t) presents the traffic speed at the time t; andvðt þ dÞ is the predicted traffic speed at the time t+d; F(·) is a nonlinear function; d is the collection time interval of traffic speed data. SVR is a regression analysis model based on the support vector machine (SVM) [57]. The model is to map the input data into a higher dimensional feature space through a nonlinear mapping, and then a linear regression problem is obtained and solved in this feature space. The goal of SVR model is to find a function f(x i ) that has at most ε deviation from the actually obtained targets y i for all the training data. SVR model neglects the errors that are less than ε, and the loss will be calculated when the absolute value of the error between f(x i ) and y i is larger than ε. The structure of SVR model can be represented as follows where ℓ � presents the �-insensitive loss function; C is the constant; w can be completely described as a linear combination of the training patterns x i ; b turns out to be the coefficient of the optimization process.

Model performance measures
The performance of SARIMA-SDGM model was compared with that of SARIMA model, SDGM model, ANN model, and SVR model. As well, the prediction results of SARI-MA-SDGM models under different data collection time were compared. Three indicators including the mean absolute error (MAE), mean absolute percentage error (MAPE), and the root mean square error (RMSE) were used for the comparison. The following equations are given as: RMSE ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 n where n is the total number of observations; X i is the predicted parameter value; andX i is the original parameter value.

Study location
Traffic speed data was collected from an urban freeway corridor that called Whitemud Drive in Edmonton, Canada, through the vehicle detection stations (VDS, including loop detector and traffic video camera). The west to east direction segment between 170th street to 122th street was selected in this study. For this study, the selected segment was divided into nine segments based on the detectors location. Each segment is approximately 800 m. Fig 2 shows the selected freeway and the nine segments (http://www.openits.cn/openData1/700.jhtml).

Data collection
Traffic speed data was available from online open data [58]. Twenty-four days (5 August to 28 August 2015) of speed data was extracted from the VDS system in the open data [58]. These data were selected to test the model performance. Table 1 shows the speed data collection time and location. To compare the prediction performance under different traffic speed data collection time interval test, the original speed data is aggregated into 11 data collection time intervals (1 min, 3 min, 5 min, 8 min, 10 min, 12 min, 15 min, 18 min, 20 min, 25 min, and 30 min) for each segment as shown in Table 2.

Model performance comparison
In order to investigate the performance of the proposed SARIMA-SDGM model, the prediction results of the five candidate models were compared. The speed data which was aggregated into 1 min was utilized for the models' performance comparison. Fig 3 shows the measured speed and the predicted speed of different models for the nine segments in the morning peak hours. As well, Fig 4 shows the measured speed and the predicted speed of different models for the nine segments in the afternoon peak hours. The figures show that the predicted speed of the SARIMA-SDGM model is closer to the field-measure speed compared to that of SARIMA model, SDGM model, ANN model, and SVR model. This finding indicates that the SARI-MA-SDGM model can better capture the variation characteristics of the filed-measured speed.
To further quantitative measure the predictive accuracy of the models, the model performance measures were also shown in Tables 3 and 4. As shown in Tables 3 and 4, the SARI-MA-SDGM model performs best with the lowest MAE, MAPE and RMSE, indicating that accounting for the characteristics of the traffic speed sequence over time correlation and  Traffic speed prediction using a SARIMA-SDGM hybrid prediction model [40,50] which showed that the SDGM model outperformed the SARIMA model. The similar results can be also found for the afternoon peak hours traffic speed prediction results in Table 4.

Model performance under different time intervals
To    [11]. The observed association of increased prediction accuracy with increased data collection time interval is consistent with that from other valid forecasting methods [59]. Moreover, as shown in Figs 5 and 6, the lines between time intervals 1 min and 10 min show a sharp decrease trend for all the segments. Whereas, the lines between time intervals 10 min and 30 min show a relatively flat pattern. This finding indicates that the traffic speed prediction results can be improved significantly with the increase in time interval when the time interval is smaller than 10 min. In addition, the prediction results yield stable prediction accuracy when the time interval is greater than 10 min. This finding can be explained with the stability of the speed data under different time interval. The standard deviation of traffic speed is approximately 5.40 when the time interval is greater than 10 min, while the standard deviation is approximately 6.00 when the time interval is smaller than 10 min. The result indicates that the accurate prediction of traffic speed could be generated using 10 min and longer time interval based on the SARIMA-SDGM model structure.

Discussion and conclusion
This study investigated the impact of data collection time interval on short-term traffic speed prediction. A SARIMA-SDGM model was proposed for predicting the traffic speed under different data collected time interval. Speed data were collected from an urban freeway in Edmonton, Canada. The parametric model (SARIMA model and SDGM model) and nonparametric model (ANN model and SVR model) were also developed and compared with SARI-MA-SDGM model using three model performance measures. The model performance under different time interval was compared to provide insights into the effects of data collection time interval.
The results showed that the SARIMA-SDGM model performed best with the lowest MAE, MAPE and RMSE. Whereas, the SARIMA model showed the least performance among the five developed models. The results indicated that SARIMA-SDGM model can better capture the variation characteristics of the filed-measured traffic speed data. For the model performance under different data collection time interval, the results showed that the five model performance measures decreased with the increase in time interval. The results indicated that the prediction accuracy improves with the increase in time interval. Moreover, the SARIMA-SDGM model can yield stable prediction accuracy for traffic speed data with greater than 10 min data collection time intervals. There are some limitations to this study. (a)This study utilized the traffic speed data from 9 segments. The connection between adjacent segments may affect the traffic speed prediction performance. Future work should investigate relationship of traffic speed between the adjacent segments. (b) Uncertainty of traffic speed prediction was considered as an inevitable problem due to the stochastic volatility feature. Uncertainty model and uncertainty quantification analysis can be applied to these speed data series for short-term prediction. (c) This study shown that SARIMA-SDGM model can yield the better prediction results, but still cannot be applied in the real time traffic speed prediction. Thus, the online algorithm for short-term traffic speed prediction using state-of-the-arts methods such as Kalman filters was also a valuable research.