Two-stage short-term wind power forecasting algorithm using different feature-learning models

Two-stage ensemble-based forecasting methods have been studied extensively in the wind power forecasting field. However, deep learning-based wind power forecasting studies have not investigated two aspects. In the first stage, different learning structures considering multiple inputs and multiple outputs have not been discussed. In the second stage, the model extrapolation issue has not been investigated. Therefore, we develop four deep neural networks for the first stage to learn data features considering the input-and-output structure. We then explore the model extrapolation issue in the second stage using different modeling methods. Considering the overfitting issue, we propose a new moving window-based algorithm using a validation set in the first stage to update the training data in both stages with two different moving window processes.Experiments were conducted at three wind farms, and the results demonstrate that the model with single input multiple output structure obtains better forecasting accuracy compared to existing models. In addition, the ridge regression method results in a better ensemble model that can further improve forecasting accuracy compared to existing machine learning methods. Finally, the proposed two-stage forecasting algorithm can generate more accurate and stable results than existing algorithms.


Overview
Renewable energy has become a primary focus in academic research and has driven changes in the power industry. By the end of 2020, the world's annual renewable energy generation reached 2799 GW [1]. Among various renewable energy sources, wind energy is a very promising source [2] that takes up the largest proportion [3].
However, the intermittent and dynamic nature of wind energy introduces risks to the economical and reliable operation 2 of power systems [4].
As reported by Muesgens and Neuhoff [5], wind forecasts with 1-4 h ahead of physical energy dispatch are an effective method to reduce the balancing costs caused by wind uncertainties. Thus, numerous studies [6][7][8] have focused on developing forecasting models using different techniques to improve short-term wind power forecasting (WPF) accuracy. Generally, recent forecasting models can be classified as single model and ensemble model approaches. With single-model approaches, given training data, a statistical modeling or machine learning method is employed to construct a single model to forecast wind power generation [9][10][11]. In ensemble-model approaches, the forecasting model is developed using an ensemble method to integrate the initial forecasting results from several individual models [13][14][15]. Most existing ensemble-based studies have used a two-stage framework. For example, Feng et al. [13] exploited different machine learning (ML) techniques to build multiple single forecasting models in the first stage, and then they integrated the results with an algorithm to improve forecasting accuracy. Hao and Tian [15] proposed a two-stage WPF module with a nonlinear ensemble method in the second stage to integrate all components and forecast error values. Ensemble models are more comprehensive; thus, Freedman et al. [12] stated that ensemble models may produce more accurate forecasting results than a single model. However, some factors have not been considered in ensemble-based literature. Thus, in this study, we primarily develop a WPF model in consideration of the following aspects using the two-stage framework.
The first factor we consider is the model input in the first stage. Using hybrid data (e.g., historical wind data and numerical weather prediction (NWP) data) as input has been studied extensively in the literature [10][11] because such data can provide more input information. Nonetheless, historical wind data differ from NWP data, which are generated from weather research forecasting simulation models, in terms of the variance and mean (Section 2). Typically, previous studies have combined historical wind data and NWP data as input to the feature-learning process, which is referred to as single-input learning in this paper. Compared to single-input learning, multiple-input learning uses different channels or methods to learn features to avoid inference between different data types. Therefore, our first goal is to determine whether single-input learning or multiple-input learning is more suitable for the WPF model.
The second factor considered in this paper is the model output structure in the first stage. Typically, previous studies [9][10][11][12][13][14][15] have considered the single-output learning structure; however, for deep neural networks, such a learning structure may cause that the gradient cannot be well passed down to the lower layer sometimes. Multiple-output learning has been studied to address this issue [16], and it has been demonstrated that this manner [17] can improve the accuracy of low-level features such that the overfitting issue might be avoided. However, to the best of our knowledge, no previous study that used wind speed and wind power as input [22][23] have considered the multiple-output learning structure when constructing forecasting models. Note that if the input data is of a single type, e.g., historical wind power, the multiple-output learning structure is not suitable. Therefore, considering the 3 application of hybrid data as input, our second goal is to determine whether the multiple-output learning structure can be used to improve forecasting accuracy.
The third factor is the modeling technique employed in the second stage, which blends the forecasting results from the first stage. Dorado-Moreno et al. [21] stated that wind power ramp event forecasting should not depend on long-term historical data. Therefore, we prefer to use the latest data just before the forecasting day as the input to the second-stage ensemble model; however, the data size should not be large. With this method, it is possible to improve forecasting accuracy; however, this may result in model extrapolation considering only a small amount of training data in the second stage. In statistics, model extrapolation is a well-known issue that may occur when examining a model with data that are over the original range [20]. In ensemble-based research, although ML methods, e.g., artificial neural networks (ANN) [15] and support vector regression (SVR) [18], have achieved some success in integrating the results from the first stage, these studies did not consider the model extrapolation issue. Different ML methods have different sensitivity to model extrapolation [28]. When model extrapolation occurs, in some cases, the model shape is kept the same as that at the data boundary. In other case, the original model shape can be ruined at the point extrapolation occurs. Thus, our third goal is to investigate which modeling method is more suitable for the second stage.

Contributions
The two-stage forecasting framework is an effective way to improve WPF accuracy. However, several factors have not been addressed within this framework. To bridge the knowledge gap discussed above, for the first stage, we developed four single forecasting models with different learning structures: single-input-single-output (SISO), multiple-input-single-output (MISO), single-input-multiple-output (SIMO), and multiple-input-multiple-output (MIMO) models. The data used in this study include the historical wind speed, wind power, and NWP data. Even though our goal was to use these four models to effectively learn the data feature from different aspects to provide useful data for the second-stage ensemble model, we also compared the performance of these single forecasting models given a testing dataset. For the second stage, we integrated the results from the first stage using the ridge regression (RR) method to reduce model extrapolation errors [19]. We also exploited three popular machine learning (ML) techniques, i.e., the ANN, SVR, and Gaussian process regression (GPR) techniques, as benchmarks to demonstrate the performance when using RR method in the second stage. In addition, considering the uncertainties and intermittencies of natural wind, we incorporated the two-stage forecasting framework into a moving window based training data updating algorithm. Differing from the moving window algorithm used in the literature [25][26], the proposed algorithm used a validation set in the first stage to adjust the model's parameters. In addition, this algorithm involved two different moving window processes to dynamically update the first-stage and second-stage 4 training data. Finally, we compared the proposed algorithm to several existing algorithms to demonstrate its advantages. Our primary contributions are summarized as follows.
1. We have developed a moving window-based two-stage short-term WPF algorithm.
2. We have developed four deep neural networks in consideration of multiple-input and multiple-output learning structures.
3. We have exploited the RR method to construct an ensemble model at the first time in consideration of the model extrapolation issue in the forecasting process. 4. We have demonstrated the proposed algorithm had a better performance than existing algorithms in numerical experiments.

Organization
The remainder of this paper is organized as follows. Section 2 introduces the first-stage models, and Section 3 describes the moving window-based two-stage forecasting algorithm. Section 4 displays the results of the numerical studies. Finally, Section 5 provides the paper conclusion.

Development of first-stage models
In the deep learning-based WPF model, the combination of convolutional neural networks (CNN) and long short-term memory (LSTM) networks has shown outstanding performance [9][10][11]. Therefore, we exploit such a combination to construct the SISO, SIMO, MISO, and MIMO models in the first stage.

SISO model
The SISO model (Fig. 1) follows a structure that has been used extensively in the literature, where the input data are combined as input to the forecasting model, and single forecasting results are output.  Figure 2 shows the SIMO model architecture. Here, we assume that the multiple-output learning structure can potentially avoid the overfitting issue to construct a more robust model. Compared to the SISO model, the SIMO model forecasts wind speed and wind power (Fig. 2). Thus, the loss function can be expressed as follows.

SIMO model
The Pearson correlation coefficient between the real wind speed and wind power is greater than 0.933 in the three selected wind farms; thus, we set to 1 and to 0.9.

MISO model
A comparison of the NWP wind speed, real wind speed, and real wind power data is shown in Fig. 3. From such marginal plots, the differences between these three data appear significant; however, based on the trend of each datum type, the real wind power data have a more similar trend with the real wind speed data than the NWP wind speed data.
In addition, the real wind speed exhibits some similar patterns to the NWP wind speed. Thus, for the MISO model, we assume that the multiple-input structure can potentially learn features unaffected by other types of data to improve forecasting accuracy. Figure 4 shows the MISO model architecture. Here, the SISO structure is used extract the features of the NWP data; however, for historical wind speed and wind power data, we employ another LSTM layer to learn their internal patterns. The features extracted from the three different types of input data are then concatenated before being output.

MIMO model
The structure of the MIMO model integrates the MISO and SIMO models (Fig. 5). If the assumptions made for the MISO and SIMO models are true, such a comprehensive MIMO model is assumed to further improve the forecasting accuracy. However, if only one of the aforementioned two assumptions is true, predicting the MIMO model performance will be difficult. Although the MIMO model structure is more complicated compared to the other three models, there is no guarantee that the MIMO model will produce the best forecasting results. and LSTM layers and the same parameters in each layer. We determined the number of CNN and LSTM layers and the parameters in each layer based on our preliminary experiment, and this combination obtained a better forecasting accuracy than using SVR and GPR. However, we did not focus on optimizing the best combination of CNN and LSTM layers because this was beyond the scope of the study. Thus, with these four structures, we can learn the data feature from different perspectives and simultaneously investigate model performance by changing only the input-and-output structure.
3 Two-stage forecasting algorithm The data used for each stage are critical relative to developing a two-stage forecasting algorithm. Considering the overfitting issue, we ensured that the training sets in the first and second stages did not overlap. In the first stage, we considered a validation set to adjust the parameters of the training model; however, we did not use the cross-validation method to obtain a generally good model. The wind data, especially the NWP data in consecutive days, demonstrate some similarities [24]; thus, such a design would result in that the model performing well on the validation set possibly has a good performance on the testing set. Therefore, we used the data before the forecasting day as the validation set. In the second stage, the ensemble model was employed to further improve the forecasting accuracy; thus, we used latest data before the forecasting day in the training set. Compared to the data used in the first stage, the second stage only required a small amount of data. Note that the second-stage training data may not necessarily be the same as the validation set because sometimes the wind changed unpredictably, and such a difference would help avoid overfitting. Even though such settings in the second stage would benefit forecasting accuracy; the sudden change of testing wind data would be beyond the range of the second-stage training data. As a result, the model extrapolation may occur. Here, if the second-stage model construction method is not well selected, the forecasting accuracy would 9 be seriously affected by the model extrapolation issue. first-stage models. We then set the window size of the first-stage training set to one year. In addition, the window sizes of the validation, second-stage training, and testing sets were set to 10 days. For example, when forecasting the short-term wind power on January 11, 2020, we used data from 2019 as the first-stage data and data from January 1, 2020 to January 10, 2020 as the validation and second-stage training sets. When forecasting the wind power on January 20, 2020, we used the 2019 data as the first-stage data and data from January 1, 2020 to January 10, 2020 as the validation set. However, the second-stage training data were updated to include date from January 10, 2020 to January 19, 2020.
In addition to the moving window algorithms, the four proposed deep neural networks (Section 2) were applied to the first-stage data to adjust parameters. In the second stage, the first-stage models (i.e., MIMO, MISO, SIMO, and SISO) were applied directly to the second-stage wind data to obtain the forecasting results of , , , and (see Table 1 for the corresponding definitions). This was a key step in the two-stage forecasting framework because it connected the first and second stages. As a result, it was possible to learn the forecasting errors from the first stage models by comparing , , , and to their corresponding real power . Then, an ensemble model generated by RR method was constructed with , , , and as input and as output.
When forecasting the next-period wind power, the testing data were first input to the four well-trained models in the first stage to obtain , , , and , which were then input to the RR model (see the blue lines in Fig. 6).
Algorithm 1 shows the detailed information about this two-stage forecasting algorithm.  We implemented the proposed algorithm with the data from three wind farms in China to evaluate the short-term WPF accuracy. Table 2 shows the basic information of these wind farms. The wind conditions of the three wind farms were not quite the same when comparing the mean and standard deviations of their wind speed and power. Therefore, the results of our case studies are likely to demonstrate the generalizability of the forecasting models.
The wind data for 2018 and 2019 were used in these case studies. Here, we randomly selected the 10-day data 11 from each season as a testing set to examine the performance of the models. Thus, when Moving Window 1 moved to the new step, the evaluation set for the first day of forecasting coincided with the second-stage training set. The prediction for the 2-h ahead wind power was typically used as a benchmark to evaluate the accuracy of a forecasting model in the current power system. Thus, we implemented the proposed algorithm to forecast the 2-h ahead wind power at the target wind farms and used the root mean square error (RMSE) and mean absolute error (MAE) to evaluate the performance of the forecasting results.

Experiment 1
In the first experiment, we investigated the effects of the four models on forecasting accuracy in consideration of the different model structures in the first stage. Here, Moving Window 2 ( Fig. 6) was not performed in this experiment.
In other words, we only constructed the first-stage models and used them to forecast the 10-day wind power. Then, we obtained the results for the four seasons (Table 3) after applying the evaluation measurement in Eqs. (3) and (4). The best forecasting accuracy in each row of Table 3 is highlighted in bold.
The results given in Table 3 demonstrate that the accuracies of the proposed models differed in different seasons, which indicates that the different input-and-output learning structures generated different forecasting results. In addition, we found that most results from the SIMO model demonstrated the best forecasting accuracy. The average performance of these four models is shown in Fig. 7.   model demonstrated a slightly better performance than the MISO model in these three cases, indicating that the multiple-output structure was effective in terms of improving the forecasting accuracy. However, when comparing MISO to SISO, we cannot obtain a clear conclusion that the multiple-input structure would improve forecasting accuracy. Thus, these results indicated that mixing the different input data together as a single input was better when learning the data features. In addition, adding an auxiliary output was able to help improve forecasting accuracy. Fig. 8 shows the variances of the absolute difference between the predicted and real wind powers obtained by four proposed models. We found that the differences were not quite the same in each case. However, the variances of the SIMO other models. To further examine these conclusions obtained from Figs. 7 and 8, we combined all forecasting accuracies for the four seasons in Table 3 and performed a statistical analysis using the paired t-test. As shown in Table   4, all -values were less than 0.05, and all the confidence intervals (CI) were less than zero when comparing the forecasting accuracies of the SIMO model to those of the other models. In other words, the SIMO model generated statistically better forecasting accuracies than the other models. As shown in Table 4, even though the SISO model statistically had the same performance as the MIMO and MISO models, we used all four models in the first stage to produce the forecasting results for the second stage because the statistically same forecasting accuracy did not indicate that they output the same WPF results. This comparison was performed to emphasize the outstanding performance of SIMO model.

Experiment 2
We implemented the proposed two-stage algorithm to construct the forecasting model. Although the SIMO model discussed in Section 4.1.1 demonstrated better performance compared to the other models, the second stage of the algorithm attempts to further improve accuracy using the RR method. As mentioned previously, we preferred to use the latest information before the forecasting day to construct the ensemble model. This method effectively updates the ensemble model over time; however, this may lead to model extrapolation caused by the quantity of the input data.
Thus, the selecting the most appropriate modeling technique in the second stage is critical. Here, we used the SVR with a radial basis function (RBF) kernel, ANN, and GPR methods as benchmarks to better demonstrate the RR method performance. In addition, we used the sklearn toolbox in Python 3.7.3 to implement these four methods and tuned the parameters of each method using the grid search algorithm. The parameter settings are listed in Table 5.   Table 6 shows the forecasting accuracies for each season using different statistical modeling techniques, and Fig.   9 shows the interval plot of the forecasting accuracies presented in Table 6. In Table 6 and Fig. 9, we also showed the forecasting accuracies obtained by the SIMO model to demonstrate the advantages of the ensemble model. Although the ensemble models with the four statistical modeling methods did not outperform the SIMO model in all cases, the RR model outperformed the SIMO model in each case. In addition, the 95% CI for RR method was better than all of them, which indicated that the RR method was statistically likely to generate better forecasting results. Compared to the RR method, the ML techniques did not exhibit better performances in the second-stage model. In certain cases, the SVR-and GPR-based ensemble models demonstrated considerably poorer performances than the other models (marked as red in Table 6). We further investigated this observation by plotting the forecasting results of the WF3 winter case, which are shown in Fig. 10.  Fig. 9. 95 % confidence interval plot of forecasting accuracies using different modeling techniques.  16 Fig. 11. Input information for eighth day forecasting in the winter case of WF 3 (the first 1000 data were used for modeling, and the rest were used for testing).

Experiment 3
In this experiment, we exploited five existing methods as benchmarks to compare the performance of the proposed forecasting algorithm. Here, we used the persistence (P) and auto regressive integrated moving average (ARIMA) models, which have been used extensively in the literature as benchmarks for short-term wind power/speed forecasting. In addition, we utilized two two-stage forecasting models from Feng et al. [13] and Liu et al. [27]. The MM-DFS method by Feng et al. [13] used multiple single models in the first stage and deep feature selection techniques in the second stage, and the DWT-LSTM method by Liu et al. [27] used discrete wavelet transform and LSTM techniques. In addition, we took advantage of the smart deep learning based wind forecasting method (SDL) by Liu et al. [11], which utilized a combination of CNN and LSTM networks, as another benchmark. Table 7 shows the forecasting accuracy results. As seen, the proposed two-stage forecasting (TSF) algorithm demonstrated the lowest RMSE and MAE values in all cases. In addition, Fig. 12 shows the CI of the forecasting results obtained using different forecasting algorithms, and the TSF algorithm obtained the best forecasting accuracies.
In addition, we integrated all forecasting results and statistically analyzed the absolute difference between the forecasting results and the real wind power. Table 8 shows that the proposed algorithm obtained the lowest mean and variance values, which indicates that the proposed algorithm could generate more accurate and stable results compared to the other methods. An interesting finding is all of these two-stage algorithms (the proposed algorithm, MM-DFS, and DWT-LSTM) obtained overall better forecasting results than the other three because ensemble-based models are more comprehensive.  (The lowest variance and mean of the absolute difference is presented in bold)

Conclusions
WPF using an ensemble model is an extensively studied topic; however, several factors have not been considered within this framework. Therefore, in this paper, using the two-stage framework, we first introduced deep neural networks in the first stage to learn data features from different perspectives. Then, we explored the model extrapolation issue that occurred in the second stage, which has been ignored by the WPF community. Finally, we considered a validation set and incorporated the two-stage framework into a moving window based training data updating algorithm, which had two moving window processes to handle the overfitting issue.
Our experiments were divided into three parts. The first experimental results demonstrated that the single-input-multiple-output model obtained forecasting accuracy than the other three models, which indicates that combining historical wind data and NWP data as a single input and adding wind speed as output is likely to improve forecasting accuracy. In addition, the results of the second experiment demonstrated that using SVR and GPR in the second stage might result in model extrapolation errors, which highlights the importance of selecting a suitable ML algorithm to integrate the forecasting results from the first stage. The results of the third experiment demonstrated that the proposed two-stage algorithm generated better and more stable forecasting results than existing methods, which implies the advantages of the proposed algorithm.
The proposed forecasting algorithm could be beneficial to the intraday power market from different perspectives.
First, improved forecasts could reduce the balancing costs for power system operators. Second, improved forecasts could benefit wind energy traders in decision-making relative to purchasing costs in a microgrid system. Third, improved forecasts could reduce the penalty costs for wind plant owners because power systems typically require owners to submit short-term WPF in the intraday market. Overall, such an improved model is expected to increase the reliability of power systems and simultaneously offer benefits to energy traders and wind plant owners.
In future, probabilistic WPF based on the proposed TSF algorithm should be investigated. Future work could also focus on reducing the extrapolation issue of other comprehensive algorithms in the second stage.