Context-Aware Traffic Prediction Framework Based on Series Decomposition

Forecasting traffic flow is a typical time series problem, which has attracted increasing attention due to the urgent need in intelligent transportation systems. Although numerous time series forecasting methods have been investigated in past decades, from statistics based models to deep neural networks models, the main disadvantages of aforementioned work could be summarized as follows: 1) incapable to handle the complexity and uncertainty of series; 2) incapable to consider external features such as spatial information and importance of points during the learning process; 3) unstable performance on forecasting task given various data patterns. In this study, a novel strategy was proposed to extract context-awareness information and then integrated with Temporal Convolution Network(TCN) model, namely Context-Aware Temporal Convolution Network(CATCN), which utilized local sub-segments to portrait the potential patterns of a series based on series decomposition. The experiments were conducted using three sets of field-captured traffic datasets. The results were presented and compared to state-of-the-art methodologies. The results showed that the performance of proposed method is significantly improved, especially, on the auto-correlation series corpora.


I. INTRODUCTION
In recent years, highway traffic flow prediction has gained increasing attention, as monitoring the conditions of road networks is important in establishing intelligent transportation systems(ITS). It can be used to provide a considerable amount of information for road operators to evaluate the current traffic pattern so that traffic congestion or severe traffic accidents might be predicted in advance. However, traffic prediction is a challenging task due to the complex and dynamic characteristics of traffic.
To resolve this problem, statistical models, were widely considered by researchers at earlier stages, including auto regressive integrated moving average(ARIMA) [1] and support vector regression (SVR) [2], [3]. These methods were The associate editor coordinating the review of this manuscript and approving it for publication was Hiu Yung Wong .
proposed under the conditions of insufficient computational power and data for analysis, and therefore, they encountered difficulties in capturing high-dimensional and nonlinear characteristics. Alternatively, researchers focused on deep learning models such as a long short-term memory neural network (LSTM) in traffic speed prediction [4], [5]. A hybrid comprising a fuzzy neural network (EFNN) and a Gaussian fuzzy membership function was introduced to predict the traffic speed [6]. A traffic graph convolution LSTM neural network(TGC-LSTM) was proposed to estimate traffic graph convolution based on a physical network topology combined with LSTM to improve the prediction performance [7]. Although all aforementioned methods were investigated for traffic prediction, several issues were still existed: 1) Some methods required neighbouring information to be incorporated in neural networks. While this procedure could enhance the prediction capability, it also deteriorated model performance as it required evaluating spatio-temporal effects of connected parts. In highway speed prediction, for example, no considerable neighboring effect is exhibited, as highway networks are generally not as complicated as city road networks. 2)Basically, traffic speed prediction can be considered as a task to predict the speed time series with seasonal patterns that can be extracted from prior data series. But in reality, time series are usually much more complicated, which makes capturing different patterns to be a challenging task. Meanwhile, abilities of deep learning models to pick up seasonality and trends from given series are still insufficient.
When predicting a time series data, it is more common to forecast the latter value via the seasonality of sequential data. Therefore, the values of segments in sequence play a significant role. In the present paper, we propose a novel Context-Aware based Temporal Convolution Network named CATCN to solve the traffic prediction problem. It implies extracting prior periodic knowledge and combining it with original sequence. In our study, traffic flow with periodical changes indicates that it has autocorrelation feature, suggesting that its variation patterns can be easily grasped. Finally, in the conducted experiments, we find that the separated data comprising the information about the observed traffic flow with explicit periodic changes provide better capability. This confirms that the idea of including prior periodic knowledge is deemed. The key contributions of this research include the following: • We propose a mechanism to extract the importance features of every point under its micro-context condition, considering both seasonality of a global series and its local neighbors. We choose several classical decomposition methods to compute the correlation between a sample and a target area in corpora.
• The proposed model leverages both micro-context sensitivity and global longer periodic dependencies. Unlike other spatio-temporal based approaches, our CATCN model does not need additional features and utilize only the generated decomposition features by a series itself.
• The result of evaluating the proposed method on three real-world traffic datasets demonstrate CATCN provides better capability of capturing patterns in a series. The proposed method achieves the lowest forecasting error compared with four state-of-the-art methods.
The rest of this paper is be organized as follows. Section II provides an overview on the related research works dedicated to time series forecasting and traffic flow prediction. In Section III, we first describe the overall architecture of the proposed framework and then introduce the detailed building modules that include the determination of a sliding window, context-aware feature generation, and context-aware convolution. In Section IV we discuss the results of the experiments conducted on three different datasets and compare the performance of the proposed method and the alternative approaches. Eventually, in Section V we conclude on the results acquired from experiments and summarize the overall contribution of this research.

II. RELATED WORK A. TIME SERIES FORECASTING
As one of the most commonly used models in machine learning, time series forecasting could be applied in various fields [8], [9]. In recent years, due to the characteristics and the basic utilization of traffic flow prediction, it has been considered as a time series forecasting problem.
Earlier methods, such as ARIMA [10] or XGBoost [11], are widely used in time series tasks. Due to its mathematical soundness, ARIMA can achieve an acceptable performance [12]- [15] and can be combined with the other neural networks to further upgrade its performance [16], [17]. XGBoost is frequently used with the combination of other modules so that advantages of each modules can be integrated and yielding better results [18]- [20]. With the rapid evolution of deep learning frameworks, the time series forecasting problem is mostly considered from the viewpoint of neural networks, including LSTM [21], [22] and WaveNet [23] that has been initially designed for audio generation [24]. Among other methods, we can consider like TCN that is deemed applicable to specific issues, and simply relies on the dilation convolution capturing longer temporal information with a growing reception field. TCN ignores the local periodic characteristics of convolution features [25]. Other hybrid TCN approaches, e.g., Multi-Stage TCN (MS-TCN), Ensemble Empirical Mode Decomposition-Temporal Convolutional Network (EEMD-TCN) and Temporal Graph Convolutional Network (T-GCN) are integrating external information to help improving the forecasting capacity [26], [27].

B. EXISTING MODELS FOR TRAFFIC PREDICTION
A non-convex low-rank plus sparse decomposition model attempts to separate the rearranged matrix into low-rank and sparse matrices. Therefore, the resulting non-convex optimization problem can be efficiently handled using the augmented Lagrange multiplier (ALM) algorithm [28]. Meanwhile, several existing methods were investigated to apply convolutional neural networks(CNN) to traffic prediction owing to the recent advancement of the CNN-related networks and their excellent performance. Wu et al. proposed a model defined as a mixture of CNN and LSTM [29]. The model relied on the powerful feature extraction ability of CNN and considered the characteristic of the traffic prediction problem through LSTM. Fusion convolutional LSTM network(FCL-Net) was proposed to integrate the spatial and temporal dependencies [30]. A model called ITRCN attempted to convert a traffic network into images and apply a CNN to extract underlying characteristics. Moreover, it processed temporal features by using the gated recurrent unit(GRU) [31]. The methods based on CNN are deemed more capable of capturing spatial dependencies. However, CNN may fail to consider the locality of sub-segments simultaneously, local periodicity might be ignored.
A graph convolution network(GCN) can be used to simulate dependence between connecting neighbors in the non-Euclidean space and aggregate the spatial information of related nodes [32]. Although GCN allows incorporating the impact of a neighborhood into prediction, while the dependence on a pre-defined graph structure makes it unstable for dynamic scenario. Zhao et al. [33] proposed a model combining GCN and GRU called T-GCN that also captured both temporal and spatial relations. Many researchers have already found that the structure of a road and its connected nodes, such as interchange or toll stations, provides relevant information. Zheng et al. [34] proposed the generative multiadvanced network(GMAN) that utilized multiple attention blocks to model the impacts of spatio-temporal factors on prediction performance.

III. FRAMEWORK
To address the traffic prediction problem, we propose a novel framework that is based on fusing local structure contexts and global trends to enable the model to better capture series patterns. The overall architecture is illustrated in Figure 1. Sensitivity fusion captures periodic local dependencies and combines prior knowledge with the original series. Causal convolution can ensure the consistency of the channels during the fusion process. Sensitivity decomposition module decomposes the trend and seasonal components of the original series that can preserve the global features. Meanwhile, the method generates VSD, VMD and DPR vectors which preserve the local features. Then, these features are integrated by another causal convolution.Receptive fields grow exponentially as the layer deepens, enabling dilated convolution to extract both global and local patterns. After propagation through a dense layer, the network forecasts a valueŷ. It should be noted that the thickness of the last full connection varies, reflecting that the significance of each point is fused and has different weights.
Specifically, we apply casual convolution to transform a channel dimension and to establish interaction between channels. The segment levels of the original series are acquired through the sampling process. Therefore, by estimating local contexts, we can extract the context-aware vectors with the same length as the original series. The capability of capturing the longer periodic context awareness information in a neural network is ensured by utilizing stacked dilated convolution that has exponentially growing reception fields. The network outputs the forecasting results one point at a time after executing propagation through dense layers.

A. LOCAL CONTEXTS GENERATION
As the length of each context can vastly influence the capability of the proposed model to capture series patterns, we need to determine the length of each context first. We define a sliding window W that describes every single context's length throughout a sampling process. Given a series as following: in which n denotes the length of the series and x n+1 denotes its next timestamp. We first determine the upper limit of the size of sliding window |W | using moving average. Moving average plays the role of a low-pass filter that eliminates the high-frequency disturbance in a time series and maintains the useful low-frequency trend. Low-frequency filtering at time t turns to the convolution of time series S after adding a window with length |W |. Filtering function F in this window is defined as follows: in which |W | denotes the size of a sliding window; F denotes the filtering function; x t denotes a point at timestamp t.
For each point x i ∈ S, we compute the moving average with window size varies from 1 to n. The upper limit of the sliding window size W up i at the i-th position is reached when the mean absolute percentage error reaches the minimum. Therefore, we obtain a series of upper limits of |W |: We perform grid search |W | with upper limit W up i for each x i ∈ S by computing the autocorrelation coefficient as 202850 VOLUME 8, 2020

Algorithm 1: Sensitivity Fusion
with channel dimension n c and length n. Output: The series fuses local context S .
Apply zero-padding to series S, and pad |W | 2 zeros to the head and the tail of the series respectively ; 4 Sample the tail of series S to obtain target context . , x n } ; 5 Duplicate target context n times to gain target vector V t ; 6 for i=1:n do 7 Sample series S by sliding window to obtain local contexts ; follows: where x denotes the mean value; k represents the range of autocorrelation computation. The optimal |W | at the i-th position can be determined using the following rule: where w i denotes the optimal sliding window size |W | at the ith position. Eventually, the final sliding window size |W | can be determined by voting. Accordingly, the most votes may correspond to be the window size. Let us suppose that x n+1 is the point to predict for given series S; then, sample context is defined as follows: Then we sample each point to extract local context as follows: while sliding the window through the entire series for the point. In each sample context, we exclude the sample point, meaning that we only consider neighbor points as its context. It should be noted that if the length of a sample context is less than |W |, then zero-padding is applied to keep the length fixed. After sampling, we can obtain one target context and n sample contexts. In vectorization representation, the target context can be represented as a target vector V t ∈ R 1×|W | , and all sample contexts can be represented as a vector V s ∈ R n×|W | .

B. CONTEXT-AWARE VECTOR GENERATION
After context generation, we obtain |W | features for each context, and then we apply the sub-series data corresponding to each time series are mapped to points in the |W |-length space. Therefore, the historical locality of a time series can be preserved in this way, including the dimension of the series and the complexity of computations.
For each sample context, we apply three different methods to compute the similarity between itself and the target context.

1) VALUE SQUARE DEVIATION (VSD)
where |W | denotes the sliding window size; S ij is the j-th value of local context S i ; T tj corresponds to the j-th value of target context T t . VSD measures the average square deviation between two contexts.

2) VALUE MEAN DEVIATION (VMD)
VMD measures the average mean deviation between two contexts.

3) DOT PRODUCT RATIO (DPR)
DPR is used to measure the ratio of the dot product between two contexts, and the value range is [ n−2 2n−1 , 1]. Then we decompose the series using STL, a filtering procedure for decomposing a time series into trend, seasonal and remaining components based on loess [35]. The STL decomposition comprises two recursive procedures: one is inner loop and the other is outer loop. In detail, the inner loop consists of six steps: detrending, cycle-subseries smoothing, low-passed filtering of smoothed cycle-subseries, detrending of smoothed cycle-subseries, deseasonalizing and trend smoothing. Therefore, the decomposed trend and seasonality of the time series are representative features that can reflect the overall characteristics of the series.
A time series can be regarded as superposition of different components Y = T r + S e + R e , where Y denotes the original VOLUME 8, 2020 To fuse the context-aware vectors generated by the methods mentioned above, we concatenate them based on the original series in the channel direction: where ⊕ denotes concatenation; S VSD corresponds to the VSD series; S VMD is the VMD series; S DPR denotes the DPR series; S e refers to the seasonal component of the series; T r represents the trend component of the series. After channel extension, the original time series incorporates the local periodic information as its prior knowledge.

C. CONTEXT-AWARE CONVOLUTION
After the generation of context-aware vectors, we focus on temporal convolution to consider the local periodic information and accordingly to make more reasonable predictions. Context-aware convolution comprises three major steps.
Step 1 (Sensitivity Fusion): In the proposed model, we check whether it is required to compress channels and provide interactions between different channels by applying causal convolution before sensitivity fusion. This is because both the number and the length of channels corresponding to series would change once a convolution computation is applied. The aim is to realize end-to-end learning. As illustrated in Algorithm 1, applying causal convolution could ensure the consistency of channels, expanding the channel of a context-aware series that can be used for further training. The structure of the sensitivity fusion layer is presented in Figure 2.
Step 2 (Temporal Convolution): At this step, we enlarge the receptive field by stacking three dilated convolution layers with dilation equal to 1, 2, 4, ensuring that the longer periodic context awareness information is captured during the process. We aim to enable the network to predict based on different point weights, and therefore, we need to enhance the context importance through each layer. This can be achieved by fusing the sensitivity information with the current series before performing each dilated convolution. This step is illustrated in Algorithm 2.
Step 3 (Forecasting): To update the value of the series, the learned features are propagated through the dense layer at the last step. The original series were fused with prior knowledge through temporal convolution so that the points in the dense layer have different weights, and therefore, the network can approach the forecasting result in accordance to the real values automatically.

D. MODEL TRAINING
As illustrated in Figure 1 and Algorithm 2, dilated convolution is one of the key components of the proposed model. The dilated convolution operator can apply the same filter with different ranges using various dilation factors. Let d be a dilation factor and * d is defined as: where * d represents a dilated convolution or an d-dilated convolution; F is a discrete function; k is a discrete filter; p refers to the receptive field. Then the proposed model updates the weights to minimize the cost function by backpropagation. Suppose δ (l+1) represents the error term for the (l + 1)-st layer in the network with a cost function J (W , b; x, y) where (W , b) are the parameters and (x, y) are the training data and ground-truth values. If the l-th layer is densely connected to the (l + 1)-st layer, then the error for the l-th layer is computed as: and the gradients are: Eventually, to calculate the gradient with respect to the filter maps, we rely on the border handling convolution operation again and flip the error matrix δ (l) k : where S (l) is the input to the l-th layer and the temporal convolution output of the (l − 1)-th layer; rot90 denotes rotation of ninety degrees. The operation (S (l) is the ''valid'' convolution between i-th input in the l-th layer and the error with respect to the k-th filter.

IV. EXPERIMENTS
In this section, we mainly describe the setup of the conducted experiments and compare the performance of the proposed CATCN with several existing deep learning models that serve as baselines in traffic flow prediction.

A. DATASET DESCRIPTIONS
In the experiments, we use the following datasets to test the performance of the proposed model. To explicitly reveal the peculiarity of a traffic time series, we randomly extracted ten examples from all three datasets. First, we compute basic properties of each series, such as auto-correlation, mean change, mean second derivative central etc. Then, we cluster the series into five clusters, from each cluster we randomly select two examples and put all the examples together for plotting. In this way, we can make sure that the randomly selected subset of the data is representative enough.

2) SEATTLE-ILDD †
The data was collected by using inductive loop detectors deployed on freeways in Seattle area. The freeways contained I-5, I-405, I-90, and SR-520. This dataset contained the spatio-temporal information about the speed of the considered freeway system. The speed information at a milepost was averaged over the data from multiple loop detectors on the main lanes in a same direction. The dataset is aggregated into five-minute intervals with the dimension 5730 × 92. It is * https://github.com/liyaguang/DCRNN † https://github.com/zhiyongc/Seattle-Loop-Data VOLUME 8, 2020   Figure 3 (b).

3) Metr-LA ‡
Metr-LA was a dataset comprising the information from the Los Angeles highway. Specifically, the dataset contained the data on the traffic speed registered during four months using 207 sensors deployed in the county. The dimension of this dataset is 1000×92 with five-minutes interval. It is separated into observation group 1000 × 80 for training and forecasting group 5730 × 12 for forecasting. Std of the dataset is 19.19; mean value 58.55; min value 0.00, max value 70.00. According to the random sampling results represented in Figure 3(c), both sudden speed changes and static time series could be observed.
For each dataset, we run and evaluate all the methods ten times to eliminate outliers and then average the results to reduce random error. We apply Z-score normalization and split the dataset into a training set (70%) and test set (30%) in a chronological order randomly during each run, enabling the experiments to be conducted in a rigorous and controlled environment to make it generalizable.

B. COMPARISON WITH THE BASELINE METHODS
To prove the validity of the proposed approach, we compared four forecasting methods: 1) neural network-based methods, including LSTM, Transformer, TCN, and WaveNet; 2) neural networks integrated with context awareness: CATCN (the proposed method). ‡ https://www.metro.net/

C. EVALUATION METRICS
To compare the performance and the effectiveness of the considered methods, we utilized the following metrics: mean absolute error(MAE), mean absolute percentage error(MAPE) and root mean square error(RMSE).
MAPE is used to measure the relative errors, and is often reported as a percentage: where y i denotes the prediction output;ŷ i is the ground-truth value; n corresponds to the total length of the series. MAE is applied to measure the average absolute error between the predicted value and the ground-truth value and is calculated as follows: RMSE was employed to measure the deviation between the predicted and ground-truth values. RMSE was selected as it deemed more sensitive to outliers: In above equations, y denotes the ground-truth value andŷ denotes the predicted value outputted by the network.

D. PARAMETER SETTINGS
In the considered benchmark models, we used the following parameter settings: LSTM [36]: hidden dimension d h = 10 with one layer stacked; WaveNet [24]: residual channel 32; skip channel 128 with layer K = 4 for each block; three blocks stacked in total; 202854 VOLUME 8, 2020   Parameters for the context awareness integrated model are provided in Table 1.

E. LOSS FUNCTION
To train the models through back propagation and to measure the deviation between the prediction and the ground-truth values, we adapted RMSE as the loss function: where y i denotes the prediction output;ŷ i is the groundtruth value; n corresponds to the total length of the series in question. Table 2 provides the forecasting results averaged on ten runs on three traffic datasets, the best results are highlighted in bold and the second-best results are underlined. The accuracy of all tested methods applied to the PEMS-BAY dataset with the varying time length is represented in Figure 4. We illustrated the performance of the proposed method and the other four alternative approaches while extending the time length. As observed, except WaveNet, the tested models exhibited increasing errors as the time length augmented, and yet CATCN still outperformed other compared methods in terms of three metrics. We demonstrated the relevance of autocorrelation and represent model performance, as shown in Figure 5 and Figure 7. Low autocorrelation of the considered time series indicated that the sequence did not reflect typical periodical changes and tended to vary without exhibiting explicit recognizable patterns, thereby hindering sequence prediction to achieve satisfying results. We can observe that when autocorrelation is in [0,0.6], MAPE on PEMS-BAY tends to have smaller variance and the outliers differ in short intervals. Figure 6 provides the visualization of the forecasting results on the three datasets. We can notice that after integrating the context-awareness features, the proposed method has a better capability of capturing the local patterns. Meanwhile, even though the trend of the forecasting result is consistent with the true value in general, the proposed CATCN still meets challenges when the abrupt change occurs.

V. CONCLUSION
In the present paper, we proposed a novel deep learning architecture that was capable of performing local features extraction and combining prior knowledge with the original series to achieve better performance in traffic prediction compared with the existing methods. It should be noted that the proposed network relied on a generic method and therefore, it could not only achieve better results being applied to in traffic series but also was expected to perform well in general time series forecasting. The proposed CATCN method could learn a pattern of the local fluctuation and enhance performance after extending the channels of time series with periodic trends. The end-to-end training was realized by integrating causal and dilated convolution, thereby improving the robustness of the proposed network.
The results of the conducted experiments indicated that the proposed CATCN achieved the best results compared with the considered alternative methods and also demonstrated that integrating traffic time series with local sensitivities allowed capturing useful information. Furthermore, the proposed method did not require to train attention weights, still providing better capabilities compared with the method using attention mechanisms. Since 2016, he has been an Assistant Professor of computer science with the Zhejiang University City College. He had extensive experience in big data system development, artificial intelligence applications, and software engineering management. His research interests include data mining and social network analysis, especially spatio-temporal data mining.
LEI XU received the master's degree in computer science from Zhejiang University, Hangzhou, China, in 2015. His research interests include system architecture, cloud computing, and so on.
SHENGLI ZHOU was born in Wenzhou, Zhejiang, China, in 1982. He received the Ph.D. degree from Army Engineering University, in 2018. He is currently an Engineer with the Zhejiang Police College. His research interest includes concern network security.
ZHEN JIANG received the B.S. degree in intelligence science and technology from Central South University. He is currently pursuing the master's degree with Zhejiang University. His research interests include machine learning, deep graph learning, and so on. VOLUME 8, 2020