Combining LSTM Network Ensemble via Adaptive Weighting for Improved Time Series Forecasting

Time series forecasting is essential for various engineering applications in finance, geology, and information technology, etc. Long Short-Term Memory (LSTM) networks are nowadays gaining renewed interest and they are replacing many practical implementations of the time series forecasting systems. This paper presents a novel LSTM ensemble forecasting algorithm that effectively combines multiple forecast (prediction) results from a set of individual LSTM networks. The main advantages of our LSTM ensemble method over other state-of-the-art ensemble techniques are summarized as follows: (1) we develop a novel way of dynamically adjusting the combining weights that are used for combining multiple LSTM models to produce the composite prediction output; for this, ourmethod is devised for updating combining weights at each time step in an adaptive and recursiveway by using both past prediction errors and forgetting weight factor; (2) our method is capable of well capturing nonlinear statistical properties in the time series, which considerably improves the forecasting accuracy; (3) ourmethod is straightforward to implement and computationally efficient when it comes to runtime performance because it does not require the complex optimization in the process of finding combining weights. Comparative experiments demonstrate that our proposed LSTM ensemble method achieves state-of-the-art forecasting performance on four real-life time series datasets publicly available.


Introduction
Time series is a set of values wherein all values of one index are arranged in chronological order. The objective of time series forecasting is to estimate the next value of a sequence, given a number of previously observed values. To this end, forecast (prediction) models are needed to predict the future based on historical data [1]. The traditional mathematical (statistical) models, such as Least Square Regression (LSR) [2], Autoregressive Moving Average [3][4][5], and Neural Networks [6], were widely used and reported in literature for their utility in practical time series forecasting.
Time series forecasting has fundamental importance in numerous practical engineering fields such as energy, finance, geology, and information technology [7][8][9][10][11][12]. For instance, forecasting of electricity consumption is of great importance in deregulated electricity markets for all of the stakeholders: energy wholesalers, traders, retailers, and consumers [10].
The ability to accurately forecast the future electricity consumption will allow them to perform effective planning and efficient operations, leading to ultimate financial profits for them. Moreover, energy-related time series forecasting plays an important role in the planning and working of the power grid system [7,8]; for instance, accurate and stable wind speed forecast has primary importance in the wind power industry and make an influence on power-system management and the stability of market economics [11].
However, the time series forecasting in the aforementioned applications is an inherently challenging problem due to the characteristics of dynamicity and nonstationarity [1,6,13,14]. Additionally, any data volatility leads to increased forecasting instability. To overcome the above challenges, there is a growing consensus that ensemble forecasting [3-6, 13, 14], i.e., forecasting model combination, has advantage over using a single individual model in terms of enhancing forecasting accuracy. The most common approach of  ensemble forecasting is simple averaging that assigns equal weights to all forecasting component models [3][4][5]. The simple averaging approach is sensitive to extreme values (i.e., outliers) and unreliable for skewed distributions [6,14]. To cope with this limitation, weighted combination schemes have been proposed. The authors in [2] proposed the Least Square Regression (LSR) that attempts to find the optimal weights by minimizing the Sum of Squared Error (SSE). The authors in [15] adopted the Average of In-sample Weights (AIW) scheme where each weight is simply computed as the normalized inverse absolute forecasting error of an individual model [16]. The authors in [6] developed a so-called Neural Network Based Linear Ensemble (NNLE) method that determines the combining weights through a neural network structure.
Recently, a class of mathematical models, called Recurrent Neural Networks (RNNs) [17], are nowadays gaining renew interest among researchers and they are replacing many kinds of practical implementation of the forecasting systems, previously based on statistical models [1]. In particular, Long Short-Term Memory (LSTM) networks-which are a variation of RNN-have proven to be one of the most powerful RNN models for time series forecasting and other related applications [1,15]. The LSTM networks can be constructed in such a way that they are able to remember longterm relationships in the data. The LSTM networks have been shown to model temporal sequences and their long-range dependencies more accurately than original RNN model [1]. However, despite the recent popularity of the LSTM networks, their applicability in the context of ensemble forecasting has not been investigated yet. Hence, to our knowledge, how to best combine multiple forecast results of individual LSTM models still remains a challenging and open question.
In this paper, we present a novel LSTM ensemble forecasting method for improved time series forecasting, which effectively combines multiple forecasts (predictions)(throughout the remainder of this paper, both terms of "forecast" and "prediction" will be used interchangeably) inferred from the different and diverse LSTM models. Especially, we develop a novel way of dynamically adjusting the so-called combining weights that are used for combining multiple LSTM models to produce the composite prediction output. The main idea of our proposed method is to update combining weights at each time step in an adaptive and recursive way. For this, the weights can be determined by using both past prediction errors (measured up to the current time step) and forgetting weight factor. The weights are assigned to individual LSTM models, which improve the forecasting accuracy to a large extent. The overall framework of our proposed method is illustrated in Figure 1. Results show that the proposed LSTM ensemble achieves state-of-the-art forecasting performance on real-world time series dataset publicly available and it is considerably better than other recently developed ensemble forecasting methods as it will be shown in Section 4.
The rest of this paper is organized as follows. Section 2 describes our proposed approach for building an ensemble of LSTMs that is well-suited for use in time series forecasting. Then, we discuss how to find combining weights for the purpose of adaptively combining LSTM models. Section 4 presents and discusses our comparative experimental results. Finally, the conclusion is given in Section 5.

Building LSTM Ensemble for Time Series Forecasting
In the proposed method, an ensemble of LSTM networks should be first constructed in an effective way of maximizing a complementary effect during the combination of multiple LSTM forecast results, aiming to improve forecasting accuracy. Before explaining our LSTM ensemble construction, Mathematical Problems in Engineering 3 we present a brief review on LSTM network for the sake of completeness as follows. LSTM networks and their variants have been successfully applied to time series forecasting [1]. LSTM networks are applied on sequential data as input, which without loss of generality means data samples that change over time. Input into LSTM networks involves a so-called sequence length parameter (i.e., the number of time steps) that is defined by the sample values over a finite time window [19]. Thus, sequence length is how we represent the change in the input vector over time; this is the time series aspect to the input data. The architecture of LSTM networks is generally composed of units called memory blocks. The memory block contains memory cells and three controlling gates, i.e., input gate, forget gate, and output gate. The memory cell is designed for preserving the knowledge of previous step with self-connections that can store (remember) the temporal state of the network while the gates control the update and flow direction of information.
For time series prediction of the input , the LSTM updates the memory cell and outputs a hidden state ℎ according to the following calculations, which are performed at each time step . The below equations give the full mechanism for a modern LSTM with forget gates [15]: In (1), W denotes the weight of the connection from gate and gate n, and is bias parameter to learn, where ∈ { , ℎ} and ∈ { , , , }. In addition, ⊗ represents the element-wise multiplication (Hadamard product), stands for the standard logistic sigmoid function, and denotes the tanh function: ( ) = 1/(1 + − ) and ( ) = ( − − )/( + − ). The input, forget, and output gates are denoted by , , and , respectively, while represents the internal state of the memory cell at time . The value of the hidden layer of the LSTM at time is the vector ℎ , while ℎ −1 is the values output by each memory cell in the hidden layer at the previous time.
The underlying mechanism behind LSTM model (used for building our LSTM ensemble) mainly comes from the "gate" components (shown in (1)) that are designed for learning when to let prediction error in, and when to let it out. As such, LSTM gates provide an effective mechanism in terms of quickly modifying the memory content of the cells and the internal dynamics in a cooperative way [20,21]. In this sense, the LSTM may have a superior ability to learn nonlinear statistical dependencies of real-world time series data in comparison to conventional forecasting models.
In the proposed ensemble forecasting method, each of the individual LSTM networks is used as "base (component) predictor model" [6]. Note that ensemble forecasting approach can be generally effective when there is considerable extent of diversities among individual base models, namely, ensemble members, [6,13,14]. In light of this fact, we vary the LSTM model parameter [15], namely, sequence length, to increase diversity among individual LSTM networks as ensemble members. For this, we learn on each LSTM network for a particular sequence length. Underlying idea for using different sequence length parameters is to inject diversity during the generation of individual LSTM models. This approach may be beneficial for effectively modelling highly nonlinear statistical dependencies, since using multiple sequence length values allows for creating multiple LSTM models with various number of memory cells, which is likely to provide complementary effect on learning the internal dynamics and characteristics of time series data to be predicted. In this way, an ensemble of LSTM models with varying sequence length is capable of handling the dynamics and nonstationarities of real-world time series.
Assuming that a total of LSTM models in an ensemble are provided, their ensemble forecast result for time series, denoted as ( (1) , (2) . . . , ( ) ) with observations, is given bŷ( wherê( ) denotes the forecast output (at the th time step) obtained using the th LSTM model and is the associated combining weight. In (2), each weight is assigned to a corresponding LSTM model's forecast output. Note that 0 ≤ ≤ 1 and ∑ =1 = 1.

Finding Adaptive Weights for Combining LSTM Models
An important factor that affects the forecasting performance of our LSTM ensemble is a set of combining weights , = 1, . . . , . We develop a novel weight determination scheme which accounts for capturing the time varying dynamics of the underlying time series in an adaptive manner.
We now describe the details of our weight determination solution, which has been developed based on the following property: if the regression errors of individual estimators are uncorrelated and zero mean, their weighted averaging (shown in (2)) has minimum variance when the weights are inversely proportional to variances of the estimators [22]. This property provides the theoretical foundation for developing our weight determination solution; for details on this proof, please refer to the Appendix section. The combining weights are computed in the following recursive way: where we suggest = 0.3. Note that Δ ( ) is computed based on the inverse prediction error of the respective LSTM base model as follows:  [18] used in our experimentation. Type  Total size  Training size  Validation size  Testing size  River flow  Stationary, nonseasonal  600  360  90  150  Vehicles  Nonstationary, nonseasonal  252  151  38  63  Wine  Monthly seasonal  187  112  29  46  Airline  Monthly seasonal  144  86  22  36 The ( ) is related to past prediction error measured up to the th time step in the following way:

Time Series
where 0 < ≤ 1, 1 ≤ V ≤ and ( ) is the prediction error at each time step of the th LSTM model. In (5), ( ) is calculated by defining a sliding time window formed by the last V prediction errors. Herein, the forgetting factor is devised for reducing the influence of old prediction errors. By performing weight update in time as described in (3), we can find the weights by analyzing intrinsic patterns in successive data forecasting trials; this would be beneficial for coping with nonstationary properties in the time series or avoiding complex optimization for finding adaptive weights.

Experimental Setup and Condition.
The proposed LSTM ensemble method was implemented using TensorFlow [23] and was trained with the stochastic gradient descent (SGD). A total of ten LSTM base models (i.e., ensemble members) were used to build our LSTM ensemble. For each LSTM network, we used one hidden layer and the same number of memory cells as in the assigned sequence length value (e.g., LSTM with sequence length "5" has five memory cells). We set the and V parameters [shown in (5)] as 0.85 and 4, respectively. Also, the initial value of ( ) in (3) was set to zero. We tested our proposed LSTM ensemble on four discrete time series datasets (please refer to Table 1), namely, "River flow", "Vehicles", "Wine", and "Airline" representing realworld phenomena which are publicly available at the Time Series Data Library (TSDL) [18]. For each time series, we used around 60% as training dataset, the successive 15% as validation dataset, and the remaining 25% as test dataset. Note that validation dataset was used for finding the combining weights and all the results reported here were obtained using test dataset. In our experimental study, we used the following two error measures used to evaluate the forecasting accuracy, namely, mean absolute error (MAE) and the mean square error (MSE) [1,6], as follows: where ( ) and̂( ) are the target and forecast values (outcomes), respectively, of time series with a total of observations. Both MAE and MSE performances of the forecasting algorithms considered were computed based on one-step-ahead forecasts as suggested in [1,6,13]. Table 2  performance (in terms of both MAE and MSE) against the case of using only a single individual LSTM base model. We can see that our LSTM ensemble greatly outperforms all the individual LSTM base models in an ensemble. To support this, compared to the best individual LSTM model, prediction error "MSE" can be significantly reduced using our ensemble method of up to about 22%, 29%, 46%, and 21% for respective "River flow", "Vehicles", "Wine", and "Airline" time series. Likewise, for using MAE, prediction errors can be reduced as much as 16%, 16%, 14%, and 13% in the same order of the aforementioned time series by using the proposed LSTM ensemble method. Figure 2 depicts the actual and predicted time series obtained using our LSTM ensemble forecasting method. It is seen that, in each plot, the closeness between the actual observations and their forecasts is clearly evident. The results shown in Table 2 and Figure 2 confirm the advantage of our proposed LSTM ensemble method for notably improving forecasting accuracy compared to the approach of using a single individual LSTM network.

Comparison with Other State-of-the-Art Ensemble
Forecasting Methods. We compared our proposed method with other state-of-the-art ensemble forecasting algorithms including simple averaging (Avg.) [3][4][5], Median [15], Least Square Regression (LSR) [2], Average of In-sample Weights (AIW) [16], and Neural Network Based Linear Ensemble (NNLE) [6]. Table 3 presents comparative results. Note that all the results for comparison were directly cited from corresponding papers recently published (for details, please refer to [6]). In Table 3, the proposed LSTM ensemble achieves the lowest prediction errors, namely, the highest prediction accuracy for both MAE and MSE measures. In particular, from Table 3, it is obvious that our LSTM ensemble approach can achieve the best performance for challenging time series "Airline" and "Vehicles", each of which is composed of nonstationary and quite irregular patterns (movements).
To further guarantee stable and robust comparison with other ensemble forecasting methods, we calculate the socalled their worth values [1,6]. Note that the worth values are computed as the average percentage reductions in the forecasting errors of the worst one of all ensemble forecasting methods (under consideration) over all four time series datasets used. This measure shows the extent to which an ensemble forecasting method performs better than the worst ensemble and, hence, represents the overall "goodness" of the ensemble forecasting method. Let us denote by "error , " the forecasting error (calculated via MAE or MSE) obtained for the th ensemble method for the th time series and by "max error " the maximum (or worst) error obtained by a particular method for the th time series at hand. Then, worth

Time series
Ensemble forecasting methods A v g . [ 3 -5 ] M e d i a n [ where and are the total number of time series datasets and ensemble forecasting methods, respectively, used in our experiments. In (9) Tables 3 and 4 validate the feasibility of our proposed LSTM ensemble method with regard to improving state-of-the-art forecasting performance.

Effect of LSTM Ensemble Size on Forecasting Accuracy.
In the proposed method, the size of LSTM network ensemble (i.e., the number of LSTM models within an ensemble) is one of the important factors for determining the forecasting accuracy. In this subsection, we discuss experimental results by investigating the impact of our LSTM ensemble size on forecasting accuracy. Figures 3(a) and 3(b) show the training and testing forecasting accuracy (in terms of MAE and MSE, respectively) with respect to different number of ensemble members for the "Wine" dataset. We can see that training forecasting accuracy for both MAE and MSE continues to increase as the size of LSTM ensemble becomes large and quickly levels off. Considering testing forecasting accuracy, it seems to generally improve as the size of ensemble increases up to a particular number and repeat increasing and decreasing, and finally converges to nearly the same constant value. It can be also observed that, in Figure 3, testing forecasting accuracy for both datasets is the best when the number of ensemble members is around 10. The above-mentioned observations indicate that larger LSTM ensemble size does not always guarantee improved (generalization [24]) forecasting accuracy. Moreover, the least number of ensemble members is more beneficial for reducing the computational costs required, especially when our proposed LSTM ensemble is applied to forecast (predict) a large number of time series, which is common in energy-and finance-related forecasting applications [7][8][9].

Runtime
Performance. Furthermore, we assess the runtime performance of our LSTM ensemble forecasting framework using the "River flow" dataset. Our hardware configuration comprises a 3.3-GHz CPU and a 64G RAM with the NVidia Titan X GPU. It is found that the total time needed to train a single LSTM model is about 2.9 minutes, while the training time required to build our overall LSTM ensemble framework (for the case of ten ensemble members) is about 28.3 minutes. However, it Mathematical Problems in Engineering should be pointed out that the average time required to forecast across all time points (steps) is as low as 0.003 seconds. It should be also noted that the time required to construct our LSTM ensemble framework should not be considered in the measurement of the execution times because this process can be executed offline in real-life forecasting applications. In light of this fact, the proposed LSTM ensemble method can be feasible for practical engineering applications by considering the balance between forecasting accuracy, lower testing time, and straightforward implementation.

Conclusions
We have proposed a novel LSTM ensemble forecasting method. We have shown that our LSTM ensemble forecasting approach is capable of well capturing the dynamic behavior of real-world time series. Comparative experiments on the four challenging time series indicate that the proposed method achieves superior performance compared with other popular forecasting algorithms. This can be achieved by developing a novel scheme that can adjust combining weights based on time-dependent reasoning and self-adjustment. It is also shown that our LSTM ensemble forecasting can effectively model highly nonlinear statistical dependencies, since their gating mechanisms enable quickly modifying the memory content of the cells and the internal dynamics in a cooperative way [20,21]. In addition, our complexity analysis demonstrates that our LSTM ensemble is able to have a runtime which is competitive with the approach to use only a single LSTM network. Consequently, our proposed LSTM ensemble forecasting solution can be readily applied in time series forecasting (prediction) problems, both in terms of forecasting accuracy and fast runtime performance.

Appendix
Weighted averaging defined in (2)  where the weights satisfy the constraint that ≥ 0 and ∑ =1 = 1. The mean square error (MSE) of combined output̂( ) with respect to the target value ( ) can be written as follows [22]: where stands for the symmetric covariance matrix defined by = E[ ( ) ( ) ]. Our goal is to find the optimal weights that minimize the aforementioned MSE, which can be solved by applying the Lagrange multiplier as follows [22]: By imposing the constraint ∑ =1 = 1, we find that

Mathematical Problems in Engineering
Under the assumption that the errors ( ) are uncorrelated and zero mean, i.e., = 0 ∀ ̸ = , and the optimal can be obtained [22] The result in (A.5) shows that the weights should be inversely proportional to variances (i.e., errors) of the respective estimators. In other words, it means that the weights should be in proportion to the prediction (regression) performance of individual estimators.

Data Availability
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.