NCAE and ELM Based Enhanced Ensemble Optimized Model for Traffic Flow Forecasting

Accurate and timely traffic flow forecasting is the key technology to address the issue of urban traffic congestion, which is significant for intelligent transportation system. However, it is a quite challenging task to develop an efficient, robust and automatically generated forecasting model. In this paper, we propose a novel model, ensemble forecasting architecture based on non-negative constrained sparse autoencoder and extreme learning machine(NCAE-ELM-EA). This model is designed using NCAE to extract traffic characteristics layer-by-layer through a greedy unsupervised algorithm, and the prediction network formed by connecting NCAE and ELM serves as base learner of each lag pool. Then, employing adaptive enhanced integration algorithm to build an ensemble optimized forecasting model with scalable size suitable for each road segment. The model was applied to the actual data collected from I5 NB highways in Portland, USA, and J6-J7(N) freeways in the United Kingdom, with higher accuracy and lower labor costs compared with existing predictors.


I. INTRODUCTION
With the acceleration of urbanization process and the sharp increase in the total number of automobiles, the increasingly serious traffic congestion problem and the resulting environmental pollution have become the most important factors affecting the lives of citizens in many large cities. How to alleviate congestion, improve road utilization, and reduce fuel consumption and emissions are the problems that Intelligent Transportation System(ITS) has been researching and solving [1]- [4]. Traffic flow forecasting is a process of estimating traffic state at a certain moment in the future based on current and past traffic conditions. Accurate and timely forecasting is not only a vital component of ITS [5], but the key to the successful deployment of current ITS. However, due to the inherent dynamic randomness and large variation of traffic flow, establishing an optimal model with high accuracy, strong robustness and auto-bulidability is a very challenging task [6], [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Qi Zhou. In the past few decades, many researchers have published numerous articles in this field. In general, techniques used for traffic flow forecasting can be divided into three categories: time series analysis methods, traditional machine learning and deep learning methods.
Time series analysis methods mainly take the autoregressive moving average (ARMA) [8] as basic model, followed by the autoregressive integrated moving average(ARIMA) model [9] and the seasonal ARIMA model [10], [11]. In order to quantify the uncertainty of traffic flow sequences, GARCH was used to model conditional variance of traffic flow. Subsequently, other GARCH models, including multi-variable GARCH model [12] and asymmetric GARCH model [13], [14], were successively applied to traffic prediction and made considerable progress. However, with the deepening of people's understanding of the complex nonlinearity and nonstationarity of traffic flow, these linear methods have been unable to meet the needs.
With the continuous progress of technology, the forecasting methods based on machine learning flourishs. Sun et al. [15] used the gaussian mixture model to estimate the joint probability distribution to predict traffic flow. Fei et al. [16] proposed a dynamic linear model(DLM) based on bayesian inference to predict travel time. Zheng and Su [17] applied the k-Nearest Neighbor(KNN) to forecast traffic volume. Support vector machine (SVM) [18], [19], which has been gradually paid attention to in this field due to its outstanding nonlinear modeling ability. However, although traditional machine learning approaches have a solid mathematical foundation which can assist us understand the mechanism of data generation, they lack the ability to process high-dimensional data [20], especially when the regularity of traffic data is time-varying. It is difficult for them to describe complex nonlinear changes, so the improvement of prediction accuracy is limited.
In recent years, deep learning models with strong multidimensional nonlinear data processing ability has attracted more and more researchers to apply it to traffic data mining. Guo et al. [7] developed a technique for shortterm urban traffic flow forecasting based on convolutional neural network (CNN). In the literature [21], LSTM NN is introduced to capture nonlinear dynamic characteristics effectively. LSTM NN solves the problem of back propagation error attenuation by means of memory module, and outperforms ordinary NN. Yang et al. [22] used sparse autoencoder(SAE) to extract features from traffic flow, then obtained a robust prediction model. Generally, it is hard for single model to consistently perform well in diverse and complex scenarios, resulting in poor generalization of the model. Zhou et al. [6] introduced the δ-agree strategy to train the boosting module of stacked SAE, which constituted the ensemble model of δ-agree AdaBoost regression, and got superior performance in highway traffic flow forecasting.
According to the above discussions, due to the dynamic and complex nature of traffic flow, it is a promising research direction to combine advanced models that can extract hidden features from data with integration strategies. Moreover, existing studies on traffic prediction have not completely solved the problem of choosing appropriate input variables [23] and how to establish a unified framework of the best model flexibly and automatically. A novel idea is to use a set of networks with different lags to learn the characteristics of data from various aspects. In the stage of integration, it automatically selects accurate and diverse base predictors for each data set to ensemble. By means of enhanced integration, the selection of input variables is unified with the selection of base learners, which is more conducive to obtaining an excellent ensemble model under a consistent architecture.
In this paper, the NCAE-ELM-EA adaptive enhancement forecasting model is proposed, which reduces the labor cost of redesigning models for different road sections, and improves the generalization ability and accuracy of the model. The principal contributions of this work lie in the following four aspects.

1) To extract the characteristics from historical data
NCAEs with different time-lag are used as the basic module. 2) During this process, Taguchi method is applied to design the parameters of NCAEs, so as to receive the optimal parameters configuration corresponding to different lag values. 3) Features extracted by NCAEs with optimal structure are used as input to the predictor ELM, and NCAE connects ELM to form a prediction network as the base learner of each lag pool. 4) Using the enhanced integration algorithm, the base learners are selectively integrated to generate the optimal ensemble prediction model. The rest of this paper is organized as follows, Section II introduces the NCAE network, Taguchi method and adaptive enhanced integration algorithm. Section III presents the NCAE-ELM-EA model architecture. Section IV describes the design factors of the experiment, analyzes the experimental results and the performance of the model. section V conclusion.

II. METHODOLOGIES
The best ensemble forecasting model that can be automatically generated is designed, trained, and integrated by Taguchi method, NCAE network, ELM, and adaptive enhancement algorithm. The schematic flow of the model is shown in Fig. 1.

A. NON-NEGATIVE CONSTRAINT AUTOENCODER(NCAE)
A sparse autoencoder (SAE) [24], [25] is essentially a NN network with the number of input nodes equals to the number of output nodes. By reconstructing the input in the output layer, the codes of input data, that is, the features, are got in the hidden layer. The cost function J SAE includes the average reconstruction error between all inputs and outputs, weight decay term and sparse penalty term.
where, x is the input vector,x is the reconstructed vector in the output layer. λ controls weight decay term, β controls sparse penalty term. Weight decay term imposes nonnegativity constraint on the connecting weights.
The sparse penalty term is based on the concept of KL divergence,  Sparse penalty term is added to ensure that useful data representation can be extracted in the hidden layer when the hidden units are greater than or equal to the input units. KL divergence is to force the neurons to be inactive in most of the time, so that the hidden layer only responds to a small number of samples in the input data, thereby increasing the sparsity of the hidden layer coding and obtaining a simple and effective representation of the input data.
The average activationρ j of hidden neuron j was defined as,ρ a j , b j are the activation and bias of the hidden neuron j, respectively.
KL divergence measures the difference betweenρ j and ρ, and ensures sparsity by forcingρ j = ρ.
The training process of NCAE aims at minimizing the cost function J SAE to optimize weights and bias by using the back propagation algorithm. After a certain period of iterative training, the latent patterns hidden in the high-dimensional training data are fully extracted in the hidden layer.

B. EXTREME LEARNING MACHINE
For the common feedforward neural network(FNN), the most frequently used training method is to adjust the weights and bias through BP algorithm based on gradient descent, so as to minimize the cost function in successive iteration to approach the ideal output. However, gradient-based methods are prone to fall into local minima, excessive training will lead to poor generalization, and when the number of hidden nodes is large, training will be time-consuming and convergence is slow. In order to solve these issues, Huang et al. [26] proposed Extreme Learning Machine (ELM) in 2005. In this method, the input weights and the hidden layer bias are given randomly, and the output weights are calculated according to the generalized inverse operation of the hidden layer output matrix.

Algorithm 1 ELM Algorithm
Input: The loss minimization function can be approximate -ly expressed as Step1: Randomly given input weights w j and hidden layer bias b j (j = 1, · · · , N ), M equations of loss function can be expressed as H β = T . Step2: Calculate the hidden layer output matrix H , Output:β

C. TAGUCHI METHOD AND PARAMETER SETTINGS OF NCAE
In general, time series forecasting is based on past observations of the series to capture potential features, so an appropriate number of past values, termed lag, is the most critical part of the prediction [27]. If the lag is small, important information may be missed. On the other hand, if the lag is large, useless inputs may introduce noise. Both of these conditions are not conducive to better prediction [28], [29]. Furthermore, the optimal lag values for different data sets are generally different. For the same data set, the potential data patterns learned by the networks with different lags are also various. It is obviously unreasonable to consider only result corresponding to the optimal lag and ignore the contribution of other lags to the final prediction. Therefore, we need to adaptively select a set of appropriate lags according to various data sets, and then integrate the characteristics learned by different networks to obtain an optimal result. In addition, the number of hidden nodes is also closely correlated with the number of input variables, especially autoencoder. It affects the dimension of the mapped features space directly, which in turn affects whether the extraction of the input data is correct, sufficient and or not. When NCAE is trained, the effect of sparsity penalty term, as well as the size of the sparse degree and the weight constraint are related to the features extraction of NCAE. Therefore, how to set these design parameters reasonably and effectively, and analyze their correlation will determine the performance of the model.
In this paper, Taguchi method is used to select the levels of NCAE's parameters. Taguchi method [30] is actually a series of orthogonal experiments, in which the parameters are configured by orthogonal array. Through these representative experiments, the effect of each factor's levels is analyzed to determine the optimal design parameters combination. This method not only handles the complex adjustment of parameters and large workload of trial-and-error method, but also separates the effect of each design factor through the orthogonal combination of parameters, and understands the contribution of each factor to the model more effectively. For instance, there are five design factors contained five levels in each lag pool. Using trial-and-error method requires 3125 5 5 experiments. When an orthogonal array L 25 5 5 is applied, only 25 experiments are needed to evaluate the influence of each factor to determine the optimum condition. Compared the number of trials required by the Taguchi method with trialand-error method, 3100 trials (3125-25) can be saved. There were 10 lag pools of the NCAE-ELM-EA, which saved a total of 31,000 (10 * 3100) experiments. This demonstrates the effectivesness of Taguchi metnod in the designed of model.

D. ADAPTIVE ENHANCED ENSEMBLE
Ensemble learning has been proved to be an effective method to improve accuracy or decompose complex problems into easier subproblems [31]. Ensemble learning [32], [33] is a general learning strategy, rather than a specific learning algorithm, which integrates the results of multiple complementary basic models by certain rules to obtain better prediction than a single model. In other ensemble models with regard to traffic flow forecasting, one method is to select the best parameters through different layers of the ensemble model. Once the optimized model is determined, other poor performance models are discarded [27]. The other method is to artificially determine a set of appropriate lags through abundant experiments to train and obtain base learners, and integrate the predictions of these models [34]. Both these methods have their own flaws. On the one hand, although the abandoned sub-optimal models have poor performance compared with the best one, they still have learned the underlying data features from different levels and provided some useful information about the future traffic condition. It is unreasonable to discard them directly. On the other hand, determining a group of base learners according to experiences is illogical, which cannot fully capture patterns of training samples and result in the model lacks flexibility, thus affecting the further improvement of prediction performance.
In this paperthe model not only makes prediction based on the features extracted by NCAEs, but also focuses on the accuracy of the base learners. At the stage of integration, an adaptive enhancement strategy is adopted to verify whether the addition of each lag pool's network has maded a contribution, so that the accuracy of the ensemble output is improved, as well as the flexibility to deal with different data sets is increased.
Therefore, a series of lags are set to constitute basic lag pools for integration. In each lag pool, Taguchi method is used to design a set of parameters' levels to configure different base learners. Then using bagging method to further ensure the diversity among base learners in each lag pool. Finally, the adaptive enhanced ensemble algorithm (Algorithm 2) is applied to generate the ensemble prediction model.

Algorithm 2 Adaptive Enhanced Ensemble Algorithm
Input: The optimal base learner of each lag pool, its MAPE on the valid sets. Step1: Arrange all base learners in ascending order according to their MAPE Step2: Integrate the top two base learners to get the first ensemble model. Step3: Each base learner in the sequence is added to the ensemble model one by one. Finally, according to the prediction performance of the model after each integration operation, the best ensemble model is determined. Output: Optimal ensemble prediction model.

III. NCAE-ELM-EA FORECASTING MODEL
This paper proposes an enhanced ensemble architecture prediction model (NCAE-ELM-EA) based on stacked NCAE and ELM. The major steps of model can be described in Fig. 2.
The model is mainly composed of three parts: (1) preprocessing module, hample filter was used to process the missing values and outliers of the original data sequences, then the data series was divided into samples belonging to different lags. (2)Training module, trains NCAEs and its subsequent connected regression layer, analyzes the relationship between factors' levels and the prediction performance, thus determines the optimal NCAE structure, in each lag pool. The selected NCAE with optimal structure connects to the ELM to constitute a network, called NCAE-ELM prediction network, as the basic learner for each lag pool. (3)The adaptive enhancement integration module, which uses the enhanced integration algorithm to selectively integrate the best base learners of each lag pool, forms an ensemble prediction model with scalable model size.

IV. EXPERIMENTAL DESIGN AND PERFORMANCE EVALUATION A. DATA DESCRIPTION
The traffic data used in this paper are collected from the I5 NB freeway in Portland, USA [34] and the J6-J7(N) road segment of M6 highway in Britain in the U.K [22], [23]. (I5 NB road: http://portal.its.pdx.edu/. J6-J7(N) road: VOLUME 8, 2020 http: //data.gov.uk/data/search). We use the 15 minutes aggregation data(vehs/15 minutes), the collected traffic patterns and the sampling time of data are more suitable for short-term traffic flow forecasting.
In experiments, we chose data from I5 NB on the working days between September 11, 2015 and October 5, 2015(17 working days) as data set. The first 13 weekdays are grouped as the training set, the last 4 weekdays is treated as the test set. We regard data from J6-J7(N) on the March 7, 2016 and March 31, 2016(19 working days) as data set. Then, the beginning 15 weekdays are seen as the training set, the remaining 4 days is used as the test set. We discard the data belonging to weekends because the traffic states on weekends vary differently weekdays. Since we adopt the 10-fold cross-validation in the training process, the validation set is composed of 10% samples selected from the training set randomly.

B. PARAMETER SETTING
There are 6 design parameters of NCAE network, and 5 parameters of each lag pool are designed using Taguchi method which considers the design factors and their corresponding levels are shown in TABLE 1. 2) Design Factor i: N 1 , the number of hidden nodes in the first-layer NCAE, which is determined according to the selection rule of hidden nodes of MLP, the minimum is log 2 N , 2 × log 2 N , . . . , the maximum is 50, N is the number of training nodes [35], [36], as shown in Table 1. 3) Design Factor ii: N 2 , the number of hidden nodes of the second-layer NCAE. 4) Design Factor iii: λ, the coefficient of the L 2 weight regularizer in the cost function.

5) Design
Factor iv: β, the coefficient for the sparsity regularization term. 6) Design Factor v: Sparsity Proportion ρ, desired proportion of training examples a neuron reacts to, is a parameter of the sparsity regularizer. It controls the sparsity of the output of the hidden layer.

C. PERFORMANCE ANALYSIS
In order to evaluate the accuracy and efficiency of the stacked NCAE-ELM-EA model proposed in this paper, its performence was evaluated by two commonly used metrics,including mean absolute percentage error (MAPE), variance absolute percentage error (VAPE).
where y i is the observed traffic flow condition,ŷ i is the predicted traffic flow condition, and m is the number of forecasted points. An orthogonal array L 25 5 5 is used for the trial design. Based on the combination of the levels of design factors, 25 main trials are conducted to obtain the optimized structure of the certain lag pool. Each main trial corresponds to each row of the adopted orthogonal array. On the two traffic data sets (J6-J7 (N),I5 NB), the results of 25 main trials of the base learners in the base learner pool with lag 12 are shown in Table 2. As shown in Table 2, on the J6-J7(N) road, the 10th main trial with 12 input nodes(that is lag), 20 hidden nodes for the NCAE, setting 3e-2, 1 and 0.05 are the regularizer weight, sparsity term coefficient and sparsity parameter respectively, achieves the smallest MAPE (8.31). In contrast, on the I5 NB road, When an NCAE is connected to the regression layer, the 22th main trial obtains the smallest MAPE(22.37) and VAPE(49.26). when stack two layers of NCAE is connected regression layer, the 14th main trial, with 30 hidden nodes for two NCAEs, seting 3e-2, 3 and 0.3 are the regularizer weight, sparsity term coefficient and Sparsity parameter respectively, achieves the smallest MAPE (22.29). According to above discussion, in the same lag pool, the prediction accuracy of base learners with different structures on the same data set is different, and the performance of base learners with same structure on different data sets also varies greatly. If the model can be generated adaptively according to different data sets, it can not only improve the forecasting accuracy, but also reduce the modeling workload, reduce human intervention and improve the universality of the model.
The effect of design factors can be separated since the combinations of design factors of trials are orthogonal. The effect of each design factor at each of the five levels are calculated by taking the average from Table 2 for a design factor at a given level. For example, the level 3 of the design factor i is in the 11th, 12th, 13th,14th, and 15th main trials, on the J6-J7 (N) segment, when the network is a layer of NCAE and regression layer, the average of this level is 8.57% (MAPE) and 2.02%(VAPE). Lower value of MAPE and VAPE indicate better performance. On I5 NB road, when lag is 12, for two kind of network(firstlayer NCAE or second-layer NCAE connect to regression layer), the effects of design factors at five levels are shown in Fig. 3.
Based on above analysis, which level of parameters can get the best effect will be known, we can obtain the optimized parameters combination of base learners in each lag pool. The result of optimal parameters' level on two data sets, is shown in table 3 and 4.
It can be seen from Table 3 and 4 that the parameters of the selected optimal base learners vary greatly on different data sets. As a result, the NCAE with different structures in base learners can extract diverse characteristics of input data and mine potential traits of traffic flow sequence. On the J6-J7(N) and I5 NB road section data sets, the base learners that constitute the best ensemble prediction model is shown in the table 5.
For the NCAE of base learners that constitute the best NCAE-ELM-EA forecasting model, the grey scale image of features extracted from the training and test sets of J6-J7(N) and I5 NB section data sets is shown in Figure 4. The grayscale representation of data is to normalize it to [0,1], where 0 represents black and 1 represents white.   Each subgraph in figure 4 represents the features obtained by the NCAE of a certain base learner, the row represents the number of hidden nodes of NCAE, namely the dimension of features, and each column denotes a sample. From the gray-scale map, we can find that the extracted features show obvious daily periodicity. When taking different lag values, the characteristics obtained have similar rules, but there are differences in details.
In Figure 5, we choose base learners whose performance is the first and third on the I5 NB road data set, and unfold their features on a daily basis. It can be clearly seen from (C) and (D) that different NCAE structures acquire various VOLUME 8, 2020 features on the same test set. They are all stacked two-layer NCAE networks. When the lag value is smaller, the granularity of the features' grey scale image is larger during peak hours, while when the lag value is larger in the morning and evening non-peak hours, the granularity is larger. Networks with different lags capture the short-term and longterm trends of traffic flow, and diverse features will be the basis for improving the prediction accuracy of the integrated model. This paper adopts an adaptive enhanced integration strategy to establish a unified model NCAE-ELM-EA that can adaptively determine the best network structure based on the data set. When the ensemble model selects the individuals to be integrated, it arranges the prepared base learners according to their performance, and then integrates them into the model one by one. On the J6-J7(N) and I5 NB data set, the prediction results of the best ensemble prediction model on the test set are shown in Figure 6.
On the different data sets, although the prediction accuracy of base learners varies, even some of them have poor performance, the model obtained by integrating these base learners with adaptive enhancement strategy has better prediction performance. NCAE-ELM-EA synthesizes the advantages of networks with different structure, that is, by taking various lag values, the corresponding optimal network is used to fully extract data features from multiple time correlation angles, and the data rules are grasped, so as to improve the prediction accuracy. For different data sets, the ensemble model selects different integration individuals, but the selected networks are all based on the performance of networks in the training process. This approach makes the model more flexible and robust, which is more obvious on I5 NB data set with frequent and drastic data fluctuations. Moreover, the size of the ensemble model is adaptive and scalable, because the accuracy of each base learner will slightly fluctuate in different training processes, resulting in differences in performance ranking and contribution to the final prediction result, but the final prediction result with high accuracy tend to be stable.
By comparing the five prediction models of the proposed NCAE-ELM-EA, MLP, E-ELM [28], LSTM [22] and EnLSSVR [31], the accuracy and effectiveness of networks based on NCAE with the best structure and the 200494 VOLUME 8, 2020  enhanced integration method in traffic flow prediction are proved. The parameter setting of the competition models: MLP has two hidden layers, whose the number of input and hidden nodes are selected from the parameter settings in this paper. Some of the same parameters in the E-ELM model follow the setting of literature [25], and the remaining parameters are determined according to the experimental data set training in this paper. The time delay of LSTM is also determined according to the lag set adaptive in this paper for different data sets. EnLSSVR model is an  ensemble model, and its parameters are set according to literature [34].
The predicted results of all methods on test sets of each section are shown in Figure 7 and Figure 8. On all data sets, the evaluation tests of the NCAE-ELM-EA model and the comparative prediction model were repeated 10 times, and the performance results were shown in table 6.
From these results, we can find that, on the J6-J7(N) road section, the performance of the NCAE-ELM-EA model proposed in this paper is very close to that of E-ELM and MLP, with high accuracy and small error fluctuation. The MAPE of NCAE-ELM-EA was only 7.58%, which was 2.3% and 0.86% higher than LSTM and EnLSSVR respectively. On the I5 NB road section, compared with the comparison model, the MAPE of this model decreased by 2%, 5.35%, 3.82% and 2.46% respectively, and the error fluctuation(VAPE) was also the smallest. On the I5 NB road section, the performance of E-ELM model who has excellent performance on the J6-J7(N) road section, is significantly worse than that of the NCAE-ELM-EA, which proves that NCAE-ELM-EA not only has high prediction accuracy, but also has good robustness. When meet road data with frequent and violent fluctuations, the NCAE-ELM-EA model can better grasp the trend of data sequence and more accurately capture the characteristics of the fluctuation segment. Compared with other models, this model can adapt to different data flexibly, showing the advantages of enhanced integration.

V. CONCLUSION
In this paper, we proposed a novel model for short-term traffic flow forecasting. The method is the first time to use the non-negative constraint autoencoder as the base learner, the Taguchi method is used to design and obtain the optimized basic networks in each lag pool, and the ensemble model is constructed through the enhanced integration algorithm. The NCAE-ELM-EA model is applied to the real traffic flow scenario. The experimental results show that this model can be applied to different traffic flow data sets flexibly, and can achieve better prediction performance than other widely used models.