Application of residual self-fitting ensemble neural network based on LSTM in short-term power load forecasting

Accurate load forecasting helps to improve the safety and stability of power systems. LSTM neural networks based on deep learning has lower predicted error than traditional machine learning and time series models. After calculating the residuals of the predicted values and observed values of the LSTM and traditional machine learning model, it can be found that the distribution of residuals also possessed with the regularity of time series data. Therefore, the residual time series can be fitted by another model based on this regularity. To avoid the limitations of a single model, a variety of neural network models have been used to compose ensemble model. In this model, the residual can be generated and fitted by itself. In other words, it implements self-fitting. After analysing the residual distribution and the ensemble method of predictive model, this paper presents a new ensemble neural network model used for short-term power load forecasting. After the experiment, it’s obvious that the predicted results of the new ensemble neural networks is more accurate than single LSTM.


Introduction
Power system load forecasting refers to forecasting power system load at a future time based on historical data. Load forecasting is an important part of power system planning. According to the predicted time range, the power load forecasting method can be divided into long-term, medium-term, short-term and ultra-short-term forecasts. Short-term load forecasting is usually a forecast within 1 year,and hours, which are mainly used to regulate and guide the daily operation of the power sector.
There are two types of commonly used short-term load forecasting models. One is the traditional time series and machine learning model. The typical time series model is the autoregressive integrated moving average model ARIMA [1]. Typical machine learning models are BPNN artificial neural networks [2], XGBoost integrated tree model [3], SVM support vector regression model [4]. Among the above models, BPNN is used most widely because of its excellent linear and nonlinear fitting ability. However, the common problem of the above models is the lack of consideration of temporal correlation of time series data, and it is necessary to add time characteristics to help predicting results. The second type is deep learning model such as RNN [5] and LSTM [6]neural networks that have been widely used as serial models in recent years. The features of RNN and LSTM [7] [8] are that they can effectively use the historical data of time series to find their key information automatically. Their complex neural 2 network structure has a strong linear/non-linear fitting ability. Parameters are adjusted through BPTT(Back Propagation Through Time) method automatically. But the disadvantage of RNN is that when the time interval increases, the gradient may "vanish" [9] which means the gradient stay in zero and parameters can not be ajusted. LSTM has improved the RNN on the structure design [10][11] [12], which effectively avoids this problem and become more excellent and powerful. Therefore, the application scenario of LSTM is more extensive.

Structure of LSTM cell
The cell unit structure of LSTM is shown as follows [13]: = ( where σ is the sigmoid function, and i is the input gate, f is the forget gate, o is the output gate, and c is the cell input activation vectors , h is the hidden vector. The weight matrix subscripts have the obvious meaning, for example ℎ is the hidden-input gate matrix, is the input-output gate matrix etc. The output of the activation function σ ranges in [0, 1]. If the output of σ is 0, all the information of the previous state would be discarded. If it is 1, all the information of the previous state would be retained.   . ) moments values are added to the data as time characteristics, so the number of neural units in the input layer is .The final output is a single predicted value so the output layer consists of 1 neural unit, the hidden layer consists of 50 neural units, and the activation function is Relu. The training times is 500.

Predicted results of LSTM and BPNN
LSTM consists of 2 hidden layer, 1 input layer, and 1 output layer. There is 1 neural unit in the input layer, because the final output is a single prediction value.The output layer consists of 1 neural unit. Two hidden layers consist of 100 neural units, the activation function is Relu, the optimization function is RMSProp, The training times is 1500. The MAPE of LSTM is 5.19%, the MAPE of BPNN is 6.15 %. It's obvious the predicted result of LSTM is more accurate than BPNN.

Analysis of Residuals
After predicting on testing data set, we can get predicted results. Therefore the residual between predicted values and observerd values can be calculated.
Based on the analysis method of time series data, the time distribution of residuals can be analyzed in terms of stationarity and correlation.
The stationarity of time series is a prerequisite for time series analysis. In order to test the stationarity of the residual time series, ADF Unit Root Test [14]  As shown in Table1, for the residuals of LSTM and BPNN , p value is less than 0.05 , and ADF Test statistic value is less than 1% Critical Value and 5% Critical Value and 10% Critical Value . So both of their residuals time series are stable.
The correlation of residuals itself over the distribution of time can be found by analyzing the autocorrelation coefficient-ACF and the partial correlation coefficient-PACF [16][17] of residuals-LSTM and residuals-BPNN. From the above verification, the residual distribution of BPNN and LSTM is not disorderly. It can be observed that there are obvious time series characteristics. So we can use the residual as another set of time series data to predict.

Ensemble Neural Networks
The building of ensemble neural network is based on the method of Bagging and Stacking in Ensemble Learning. Bagging method is to train multiple learning models in parallel and calculate the average after summarizing the results. Stacking method is also to train multiple models. The first model is trained based on original training data. The first learning model is also called the Base layer model. The second model is trained based on the results of the first model. The second learning model is called Meta Layer model. Ensemble learning model improves to be a stronger learner by integrating multiple learning models with Bagging and Stacking methods.
The structure of ensemble neural network is shown in Figure 4. First,the base layer neural network is used to fits the data, then calculating the residuals between the results and the observed values. Second the meta layer neural network is used to fit these residuals. Finally, adding the results of the meta layer and the output of the base layer as the final result. In this process, the self-fitting of the residuals in the ensemble model is realized. After obtaining the results of two stacking models and calculating the geometric average and arithmetic average of them, then choosing the best average method.  Comparing with real residuals in these two graphs, the distribution of predicted residuals is closer to the horizontal axis , and their absolute value is less than absolute value of real residual. As predicted residuals, it is used to compensate the error between predictive values and observed values. So it is proper to be not too large in case of over compensating. Ensemble model(Arithmetic average) 3.87%

Conclusion
The LSTM neural network performs better than traditional models in power system load forecasting. To improve the LSTM network, LSTM+LSTM and BPNN+LSTM stacking models are designed with the ensemble method. The residuals predicted by Meta layer LSTM can compensate error between the Base layer predicted values and observed values by self-fitting based on historical residuals time series. More accurate results were obtained by calculating the average of the two stacked models using the Bagging method. Finally, the MAPE of ensemble neural network is 3.87%, which is 1.32% lower than the MAPE of LSTM.