Data-augmented sequential deep learning for wind power forecasting

Accurate wind power forecasting plays a critical role in the operation of wind parks and the dispatch of wind energy into the power grid. With excellent automatic pattern recognition and nonlinear mapping ability for big data, deep learning is increasingly employed in wind power forecasting. However, salient realities are that in-situ measured wind data are relatively expensive and inaccessible and correlation between steps is omitted in most multistep wind power forecasts. This paper is the first time that data augmentation is applied to wind power forecasting by systematically summarizing and proposing both physics-oriented and data-oriented time-series wind data augmentation approaches to considerably enlarge primary datasets, and develops deep encoder-decoder long short-term memory networks that enable sequential input and sequential output for wind power forecasting. The proposed augmentation techniques and forecasting algorithm are deployed on five turbines with diverse topographies in an Arctic wind park, and the outcomes are evaluated against benchmark models and different augmentations. The main findings reveal that on one side, the average improvement in RMSE of the proposed forecasting model over the benchmarks is 33.89%, 10.60%, 7.12%, and 4.27% before data augmen-tations, and increases to 40.63%, 17.67%, 11.74%, and 7.06%, respectively, after augmentations. The other side unveils that the effect of data augmentations on prediction is intricately varying, but for the proposed model with and without augmentations, all augmentation approaches boost the model outperformance from 7.87% to 13.36% in RMSE, 5.24% to 8.97% in MAE, and similarly over 12% in QR90. Finally, data-oriented augmentations, in general, are slightly better than physics-driven ones.


Introduction
Wind is a renewable, sustainable, and environmentally friendly energy resource. As wind technology has developed in recent years, wind energy has received attention from a growing number of countries for its low-cost operation and maintenance, small turbine footprint, flexibility in development scale, and rapidly decreasing electricity generation costs. [1] Meanwhile, massive electricity generated by wind energy is volatile, intermittent, and with low power density. These features influence the power production of generation companies, the balance of the grid and may profoundly jeopardize its security. [2] In a large-scale grid-connected system involving wind power, an unplanned load increase or an unscheduled wind power decrease will cause a supply-demand imbalance when thermal power or hydropower ceases generation or is insufficient. [3] Hence, the uncertainty in wind power production enlarges the required reserve capacity of the system. An accurate wind power forecast minimizes the spare capacity and enables optimal dispatch of power in systems with wind power generation. Furthermore, an effective prediction serves as a basis for wind parks to engage in generation bidding, determines a reasonable charging and discharging strategy for energy storage, and lowers the occurrence and duration of wind curtailments.
Wind power forecasting methodology is generally divided into physical, statistical, and hybrid approaches. [4] The first predicts wind power by extensive numerical computation of physical equations. It is based on fluid dynamics and uses Numerical Weather Prediction (NWP) data such as wind speed and pressure, and geoinformation like ground roughness and altitude. The method performs best in medium or longterm forecasting and applies to the wind resource assessment of new wind parks that lack historical observations. The statistical approach aims to establish linear or nonlinear patterns within wind data sequences that can be utilized in forecasting. In particular, machine learning-based wind power forecasting methods developed in recent years are widely applied. The hybrid approach is a combination of the former categories and has shown its edge profoundly. [5] In 2006, Hinton et al. successfully trained deep neural networks (i.e., artificial neural networks with several hidden layers) and achieved excellent performance on multiple datasets, [6] which signified the birth of deep learning. Since then, deep learning techniques based on neural networks of different designs have flourished and solved long-standing challenges, such as voice and image recognition and generation, preliminary implementation of autonomous driving, etc. [7] Recently, the application of deep learning to energy science has also become popular because of its powerful auto-pattern recognition and nonlinear mapping capabilities. [8] The two major drivers of deep learning evolution are progressive computational capabilities and the influx of big data. It is generally agreed that larger datasets yield better deep learning models. [9] The effectiveness of deep supervised learning relies on the volume and quality of labeled training data as well as the topology and parameters tuning of deep networks. [10] Notably, an effective solution to establish large sets of training data is data augmentation, since the training set typically lacks a sufficient number of manually labeled samples. Especially in wind energy, it is generally challenging to acquire high-quality and long-duration meteorological and power production data.
Data augmentation is a technique to make supervised machine learning, especially deep networks, more efficient. It extends the amount of available training data by adding modified versions of existing data or new data generated based on existing data. Technically, data augmentation imposes a sort of perturbation or noise on the datasets, both of which are viewed as unfavorable factors in signal processing and statistical modeling and need to be removed by implementing filters. [11,12] However, the technique effect in deep learning is to regularize the model and assist in mitigating overfitting during deep training, thereby improving the generalizability and ubiquity of the learned models.
Overfitting is a phenomenon that occurs as a learner learns a function with extraordinarily large variance, such as perfectly fitting the training data. Generalizability defines the difference in performance when a model is assessed in relation to data in the training set previously seen compared to previously unseen data in the testing set. [13] Essentially, using multi-inputs to make multistep wind power forecasting can be regarded as a Sequence-to-Sequence (seq2seq) prediction that is framed as a mapping of multiple inputs to multiple time-series outputs. It was demonstrated that the seq2seq model "approaches or surpasses all currently published results" [14] in Natural Language Processing (NLP), like Google Translate, and recently it has also shown its promise in renewable energy forecasting. [15,16] The Encoder-Decoder (ED) Recurrent Neural Networks (RNN) has successfully handled seq2seq problems [17] and exhibits state-of-the-art performance in the area of text translation that is fundamentally a time-series problem.

Previous work review
In computer science research, there are several developed methodologies in data augmentation. [18] Shorten and Khoshgoftaar [13] systematically presented current imagery data augmentation methods, their promising advances, and methodologies used to implement them to boost the performance of imagining deep learning tasks. Cubuk et al. [19] investigated several commonly used image recognition datasets and designed an augmentation strategy that learns from the datasets. The strategy consists of many sub-strategies and is automatically selected in the model training process and helps gain 0.4% to 0.6% imagine classification accuracy on different datasets. But the data augmentation technique is mainly in the field of image recognition and has received little attention to transfer the technique to the sequence domain. However, both image and sequence deep learning tasks intrinsically focus on automatically exploiting data features while avoiding overfitting. So, researchers should concentrate more on data augmentation applied to sequential deep learning. DeVries and Taylor [20] summarized and utilized interpolation and extrapolation, etc., and domain-agnostic approach to reach the predictions with deep learning for time-series datasets, and tentatively proved the techniques are timely and effective in some supervised learning problems. Park et al. [21] presented a speech recognition augmentation approach named Spe-cAugment consisting of masking features, frequency channels, and time steps to reach leading capabilities on two speech recognition mission sets.
Deep learning techniques have gotten much attention from researchers in renewable energy forecasting. [8] With its distinctive automatic nonlinear recognition capabilities, deep learning has gradually emerged as an important approach to the challenge of forecasting sharply volatile wind power. [5,22] Yildiz et al. [23] extracted wind datasets with features with variational mode decomposition and converted these features into images. Then the images were handled by an improved residual-based deep convolutional neural network to forecast wind power for a wind park in Turkey. The edge of the proposed process was proved by a comparison between some existing well-used large networks. Kisvari et al. [24] constructed a framework consisting of data preprocessing, anomaly detection, feature engineering, and gated recurrent deep learning models for wind power prediction and demonstrated that the framework offered more effective predictions than traditional recurrent neural networks. Shahid et al. [25] piled up Long Short-Term Memory (LSTM) units into a large network and tunes the network by using the genetic algorithm to forecast wind power validated the statistical advantage of the network over a single unit by the Wilcoxon Signed-Rank test. Memarzadeh et al. [26] applied a bionic algorithm, wavelet transform, feature selection, and LSTM networks to forecast wind power of two wind parks in Spain and Iran, and showed the effectiveness of the proposed method by comparison with benchmark neural networks.
While numerous wind power models based on a hybrid of traditional data methodologies and deep learning have been developed and advanced in forecasting for many sites, nevertheless, further sophistication of forecasting models may render the results specific, i.e., wind power forecasts are restricted to a certain category of terrain and weather features and difficult to be generalized and not be easily employed because their consisting techniques such as signal processing, feature engineering, etc. require a prolonged and special training to master. Lipu et al. [27] also summarized the most recent progress of wind power forecasting using artificial intelligence and pointed out the issues and challenges in the field. The challenges include many various data preprocessing techniques for diverse wind data, model structure, and optimization, etc. In particular, Reichstein et al. [28] recommended that more attention should be given to Earth system science problems to the coupled data approaches with physical phenomena and deep learning methods themselves, rather than building more complex traditional methods-based models.
In the present study, in the contrast, we return to the physical process of wind power generation, the statistical characteristics of wind data, and the nature of deep learning to approach the forecasting problem.
After synthesizing numerous data augmentation methodologies and drawing on multiple state-of-the-art advances in sequential data prediction, the robust and efficacious encoder-decoder deep neural networks with stacking LSTM units are proposed for wind turbine power forecasting in the Arctic.

Contributions
Leveraging the aforementioned literature review, attention is paid to a wind park, inside the Arctic, in complex terrain. The principal contributions of the present study paper are as follows: 1. This paper systematically applies data augmentation to wind power forecasting for the first time. Specifically, eight time-series data augmentation approaches are proposed according to physical characteristics of wind energy and statistical properties of data in wind engineering. The approaches are implemented in four benchmarking models and proposed advanced deep learning models. The methodology is particularly suitable for new wind parks that have a short period of operation and therefore a limited amount of accumulated data. It enables to fully and automatically deepen the information and value of these limited data. 2. We exhaustively develop a seq2seq deep learning predictive end-toend model with inputs of historical wind speed and power data and wind speed from NWP as well as simultaneously interrelated outputs of multistep, futuristic wind power. The model is based on an encoder-decoder constructed with LSTM and shows its superiority in forecasting power. 3. It is demonstrated that the impact of various augmentation approaches is different in each forecasting algorithm. Augmentations somewhat increase linear, like persistence, model errors. Nonetheless, augmentations improve the performance, most notably the proposed deep learning model, of neural networks-based algorithms, where data-oriented augmentations generally contribute greater than physics-oriented ones. 4. The data augmentations combined with the proposed and benchmark forecasting models are utilized to predict power generated by five turbines in various landscapes. The results are analyzed by rigorous statistical methods and indicate that the augmentations and the proposed forecasting model have wind engineering values and potentially extensive applicability in other energy sectors.
The architecture avenue opens the article with an introduction on wind energy forecasting and its deep learning utilization status quo as well as contributions presented in Section 1. Section 2 illustrates the principle of wind power generation and the utilized data and scheme. Section 3 delves into proposed data augmentation techniques and a novel predictive deep neural network. Section 4 provides detailed experiment procedures and model assessment metrics. In Section 5, hierarchical experimental results and discussions, from comparisons of models themselves to data augmentation approaches, are presented. Finally, the main findings, research outlooks, and derivative policy recommendations are demonstrated in Section 6.

Data preparation and forecast scheme
Wind power generation is a conversion from wind energy to electricity. Ideally, the output generation of a wind turbine is expressed as in (1): where P is the output power of the wind turbine (W); P v (.), typically proportional to the cubic of the wind speed, is the wind curve function at the speed interval, C P means wind energy utilization efficiency; ρ is the air density (kg/m 2 ); A is the effective area swept by turbine blades (m 2 ), v denotes the wind speed (m/s); v min , v max, and v n respectively are cut-in, cut-off, and rated wind speed. P r is turbine rated wind power. From (1), the output of a wind turbine is mainly influenced by the third power of wind speed, air density, and swept area. The study centers on the wind turbine, 3.0 MW Vestas V90, electricity production of a wind park, Fakken, with an installed capacity of 54 MW with 18 turbines, average annual production is 139 GWh in the Arctic region. Wind is predominantly influenced by the terrain; wind anomalies occur when wind moves through these areas. The influence is dependent on the height and width of the barriers. The terrain of Fakken wind park is with low and flat hills and narrow valleys, and towards a fjord.
The timescale of data in this study is from 0:00 1st January 2017 to 23:50 31st December 2017. Raw wind speed and power data of each turbine, 10 mins temporal resolution and recorded by Supervisory Control And Data Acquisition SCADA, are supplied by a local wind energy operator. The NWP wind speed data, calculated by the Meteorological cooperation on operational Ensemble Prediction System (MEPS) NWP model, are with 2.5 km horizontal resolution that is taken as the mesoscale. The model, operating by the Norwegian Meteorological Institute, updates at 00, 06, 12, and 18 UTC, and its forecasts for the next 66 h are available around 1 h 15 min later. The wind speed data sequences from NWP comprise the nearest accessible weather prediction data.
To verify the generality and portability of the proposed methodology, five wind turbines separately situated in different topographic conditions in the wind park are selected as study subjects. Moreover, wind measurements are taken at the turbine nacelle, which is 80 m about the ground. Their topographic features and statistics of annual insitu measured wind speed and power are shown in Table 1.
Statistically, wind power forecasting can be regarded as a multivariable regression problem, in which wind power time series is autoregressed, and wind speed serves as the supplementing information to the autoregression. Updating the wind speed from NWP of the predicted time, the current information, is also the key feature in the prediction since according to an extensively cited reference by Giebel and Kariniotakis [29], forecasting wind power beyond three to six hours typically requires consideration of information on NWP wind speed at the moment of prediction. In this study, we chose measured data of the previous six hours to make multistep forecasts for the wind power from the next six to twelve hours with the assistance of wind speed from NWP.
The fundamental multistep forecasting model f(.) with timestep i + n is described as: where i represents the base current time i = 1, 2, …, 7, and with each i, j = 0, 1, …, 6. P i+n is n timestep ahead predicted wind power, n ∈ {6, 7, 8, 9, 10, 11, 12}, v is the wind speed observed in the turbine, u represents the wind speed calculated from the mesoscale NWP wind model for the site. ε n is the error of the forecasting model.
Since the scopes of wind power and speed are not the same, it is beneficial to rescale the raw data into a new set with a similar scale. Data standardization is rescaling variables with a mean of zero and a standard deviation (STD) of one. The technique can accelerate convergence speed and improve algorithms' accuracy of neural networks. [30] 3. Methodology

Wind data augmentation
In practice, testing errors need to be continuously reduced along with training errors to construct meaningful deep learning models. Data augmentation is a phenomenally robust approach to accomplish this aim. It embarks on overfitting from the origin, the training data themselves, of the problem, assuming that further information can be retrieved from the source dataset.
Based on know-how in wind energy technology and state-of-the-art data science, we divide the techniques for augmenting wind data for forecasting with robust and efficient deep learning into two categories: physics-oriented and data-oriented.

Physics-oriented approaches
Inspired by the physics of wind power engineering, we propose three strategies to augment training set data for forecasting models. The first is the explicit perturbation of the wind power curve according to Eq. (1). The second is the implicit perturbation based on the difference between the numerical weather predicted wind speed of the wind park area and the actual measured wind speed of turbines. The third considers the operational data of the other wind turbines in the vicinity of the studied wind turbines. These three physics-oriented approaches are shortened as PA1, PA2, and PA3, respectively. PA1: Considering the wind speed as the independent variable and differentiating Eq. (1), the following Eq. (3) is obtained.
from Eq. (1), it is observed that when v is in the cut-in and rated wind speed interval, the derivative of the power curve, the ratio of tiny variations in wind turbine power and wind speed, is proportional to the quadratic of this point wind speed. Therefore, according to Eq. (3), it is possible to artificially adhere a slight random perturbation in a wind speed point in the interval and calculate the corresponding power variation in accordance with the speed. PA2: According to Eq. (2), the input to the power forecasting model contains the wind speed from measurements and the NWP model, but they correspond to different time stamps when entering the model. Since NWP datasets also have wind speeds that correspond to the same time stamps as the measured wind speeds, and there is no significant difference in wind speed probability distribution from two wind speed resources in the wind park based on our previous study. [31] So, we resort to a random replacement strategy with a fixed probability to replace the wind speeds in the measured datasets with the correspondent NWP wind speeds.
PA3: Since the neighboring turbines to the target turbine have similar wind conditions in operation. Therefore, adopting the measured wind speed of the neighboring turbine with a specific probability to replace the target turbine could be a strategy to augment the target wind speed dataset.

Data-oriented approaches
The proposed taxonomy for the data-oriented methods for wind power forecasting is enlightened by the feature space expansion, signal processing, and machine learning techniques. It consists of five approaches. DA1: Various simple interpolation and extrapolation methods are used to obtain data on larger time scales. DA2: Implements noise to the original dataset. DA3: Sequential augmentation approaches, named geometric transformations, draw on image processing, symmetry or flipping, translation, and random erasing. DA4: Methodology used for decomposition in time-series data. DA5: Scenario generation methods for the single turbine include statistical and machine learning generation.
DA1: Averaging is usually required to calculate the data in hourly units as the original measured dataset is in ten-minute increments. The new hourly data can be acquired by performing some interpolation or extrapolation modification to this averaging process. The new averaging is defined as: t is the hourly data and x j donates the raw 10-mins data. ω j is the stochastic weight that fulfills: DA2: Another simple, probably the simplest, method of data augmentation is the addition of white noise, following the standard normal distribution, to data. A wind power forecasting study considered noise in data as a detrimental factor for prediction and removed it by signal processing. [32] Nonetheless, in machine learning research, applying noise to the neural network's inputs increases the generalizability of the networks. [18] The noise injection is determined with a scaling parameter δ: t is the enhanced data and x t donates the original hourly data. DA3: Geometric transformations are among the initial data augmentation methods with excellent effectiveness in deep learning for image recognition, such as flipping, cropping, and color transformations. [13] Based on the characteristics of the measured wind speed time series and referring to image geometric augmentations, we stochastically opt for, 10% respectively, symmetry along with the average point, substitution of prior or posterior values, and stochastic erasing of some data.
DA4: Wind power forecasting is known mathematically as a special time-series problem. Ordinarily, the time series x t can be decomposed into base α t , trend τ t , season s t , and residual γ t part as in Eq. (6).
The extensively implemented approach is firstly based on the timedomain figure of the time series or its Fourier analysis to obtain its period corresponding to seasonality, and then decomposes the time series with the loess smoothing technique, [33] a locally weighted autoregression, into the above four components. The weights of these four components are subsequently and stochastically adjusted by Eq. (7) to form an augmented series. Note: STD is standard deviation, Skew is skewness and Kur is relative kurtosis (actual kurtosis minus 3).

DA5
: The data augmentation methodologies described above all involve randomness, data selections, and/or weight adjustments, so they are relatively independent of the data and require considerable manual fine-tuning. Wind power scenario generation is an effective tool to resolve uncertainties in stochastic planning of the energy system with the integration of wind power. [34] Classical and advanced statistical methods and machine learning models are broadly employed [35] to predict wind power scenarios. Intrinsically, these models profile conditional distributions of time series by assuming that the current value depends on previous points: a new time series may be generated from the learned conditional distributions provided that original series values are perturbed in some way.

Encoder-decoder LSTM deep networks
RNN has achieved tremendous success and wide application in numerous sequence applications. [18] RNN is designed to process learning tasks with sequential data. 'Recurrent' means the current output is related to the previous output. The nodes in hidden are structurally connected to each other to reach inputs of the hidden layers includes not only outputs of the input layer but also ones of the previoustime hidden layers.
Among the RNN network structures, the most extensively used and highly successful model is the LSTM network, with a kind of unique memory unit in its hidden layers and is generally more expressive of long-short time dependencies than the other RNNs. [36] Typically, the LSTM unit consists of three gates, i.e., input gate, forget gate, and output gate. There are three primary internal phases of the unit. The first is forget phase, which retains the important information coming in from the previous node and forgets the unimportant details. The next phase is the selective memory phase, which optionally remembers inputs of this phase. Finally, an output phase determines which ones should be treated as outputs of the current state. Mathematically, the long-short memory unit can be expressed as [37]: where x t is the input and h t− 1 is the hidden state of the previous timestep. i t , f t , and o t are input, forget, and output gates, W. denotes the corresponding weight parameter, and b . is the corresponding bias parameter. c t is the candidate memory cell, c t is the memory cell, and c t− 1 is its previous time step state. h t is the hidden state. σ (.) is the sigmoid function, tanh (.) is hyperbolic tangent function, and ⊙ represents the pointwise multiplication. The encoder-decoder LSTM is a type of EDRNN network designed to deal with seq2seq, and its architecture is innovative in terms of sequence embedding, i.e., the usage of a reading-in and exporting-out fixed-size sequences. The encoder-decoder LSTM includes an input layer, LSTM based encoder and decoder, and an output layer in this study. The LSTM unit achieves the extraction and utilization of important information in the sequence through its gate controls. The encoder reads input sequences and encodes them into fixed-length vectors by the weight of each time step with a context vector. The decoder decodes these fixedlength vectors and outputs predicted sequences. The fixed-length context vector introduces a mechanism called Attention, which enables highly summarize and highlight the information learned by the encoder and uses it as input to the decoder for translation. The encoder and decoder networks are mutually independent, which indicates that their LSTM units do not share parameters during the process of networks training.

Proposed deep EDLSTM for wind power forecasting
According to Eq. (2), wind power prediction involves autoregression, multiple sources of wind speed, and nonlinear functional relationships, all of which may lead to the application of EDLSTM networks. In addition, multistep wind power forecasting is appropriate to be handled as a seq2seq problem since the historical data of the inputs are linked and interactive. Therefore, a deep, stacked multilayers EDLSTM, shorten as EDLSTM, is proposed and utilized to extract the implicit features from layer to layer. The detailed deep EDLSTM employed in this article is illustrated in Fig. 1.
First, the encoder consists of a stack of three-layer LSTMs, which sequentially extracts complex time-dependent features of inputting measured and meteorological data deeply layer by layer with transferring hidden states h. And then generate a fixed-length context vector containing the extracted characteristic information. The structure and transmission of information for the decoder are basically identical to those for the encoder. Then, the context vector serves as the initial input to the decoder. Regardless of the updating from the encoder of the context vector, the vector is sent to the first layer of the decoder as its input, and its output is used as the input of the second layer. Sequentially, the third layer output is transformed through the output layer and cyclically fed back to the first layer as its next input. Eventually, the decoder generates a time series of the predicted wind power.

Experimental scheme
The scheme of forecasting individual turbine wind power by employing EDLSTM with data augmentation is animatedly illustrated in Fig. 2. Firstly, the measured wind speed and power with the ten-minute resolution are averagely interpolated into, except for the DA1 augmentation measure, data with hourly resolution. All hourly data are segmented into training and testing sets, accounting for 65% and 35%, respectively. Secondly, the measured wind speed and/or wind power data in the training set are separately augmented with the approaches proposed in Section 3.1 to enlarge the data amount to five times the original training set size. i. e., the new data with the four times larger size of the original training set are generated with augmentations. Thirdly, the unexpanded and expanded training sets are individually fed into the benchmark models, i.e., Persistence (PR), simple three-layer backpropagation Neural Networks (NN), basic LSTM RNN (LSTM), Bionic optimized neural networks constructed Adaboost (BA) ensemble leaning (regarded as a popular and advanced hybrid forecasting model have been proven to perform well and have been extensively studied [39,40,41,42], namely, ensemble learning perdition models) and the proposed deep EDLSTM network to conduct training and obtain multiple learned models. The benchmark models have been introduced in Ref. [41,43,44] and their parameters are briefly summarized in Table 2. Finally, the testing set data are imported into the trained models to yield the multistep predicted wind power and to assess and compare the forecasting models' performance.

Data augmentation program
Our data augmentation strategy fine-tunes the data without altering the temporal order of the original data and ensures that the augmented training data and the previous ones maintain statistical consistency. This study augments the training samples and scales up their number to five times the original sample size. The data augmentation techniques explained above, apart from DA5, all involve stochastic perturbation of the original data. Our method is to gradually enlarge the perturbation amplitude and accordingly generate new data four times. For the DA5 method, four new datasets are produced by individually operating autoregressive models based on four machine learning models. Details    [46] which is similar to seq2seq structure in forward propagation to achieve the integration of input and output for effectively mining the features; while in backpropagation some gradients are fed directly to output, avoiding gradient vanishing.) EDLSTM As described in Section 3.3 and Fig. 1. (LSTM unit is TensorFlow optimized default settings for regression problems.) of the various data augmentation approaches are shown in Table 3.

Performance evaluation
Collectively, data-driven wind power forecasting is inherently a matter of using advanced neural networks for regression in which Mean Square Error (MSE) serves as the loss function. So, Root Mean Square Error (RMSE) is naturally selected as the metric to measure the performance of the models. The metric is negative-oriented to the modeling performance, which means a smaller value corresponds to better performance.
where P i and P i are normalized measured and corresponding predicted wind power, m is the sample number of the testing set. Nevertheless, the RMSE is with a disproportionately big effect of larger errors and, sometimes, is close when comparing some different forecasting models. Therefore, in these cases, Mean Absolute Error (MAE) and Qualification Rate (QR) [47] indices are introduced as below to comprehensively assess the performance of models. MAE uniformly examines the forecasting errors while the QR emphasizes the smaller errors.
where Cap is the designed capacity of the turbine. Q is the quantile percentage for qualified predictions, chosen as 90% in this study. Two statistical tests are employed to check whether there are statistically significant differences exist in the performance of forecasting models. And both of their conference values are set as 0.05. The first is paired T-test for the two comparisons. The null hypothesis H 0 : The averages of these samples are equivalent; H a : The averages are not equivalent. And its test statistic T is: where Y is the average and l is the number of samples. The second is the Friedman test, for multiple comparisons, is harnessed to examine across multiple trials and checks column effects after statistically eliminating potential row effects. [48] H 0 : The column data do not have a significant difference. H a : They have a significant difference. The statistic F is given as: where k is the number of columns. r i is the average value of row i, which follows χ 2 (k− 1) distribution under H 0 .

Results and discussion
This section reveals the experimental results at three levels, firstly, the superiority of the proposed forecasting model is verified by analyzing different models' performance on the original dataset. Secondly, the overall effects of data augmentations on different forecasting algorithms are illustrated by the comparison of their performance before and after data augmentations. And finally, the impacts of various augmentation approaches on the proposed model's forecasting effectiveness are statistically explored.

Benchmarks and proposed deep EDLSTM model forecasting outcomes
The standardized measured and NWP wind data of chosen five wind turbines are respectively loaded into the four benchmarks and proposed deep EDLSTM models to make six to twelve hours ahead of wind power forecasts. The RMSE is displayed in Fig. 3. In general, the RMSE of all forecasting models grows as increasing prediction steps. The PR grows faster compared to the other models. The proposed deep EDLSTM outperforms best among all models for multistep power prediction for all wind turbines in almost all cases. The RMSE of the NN, LSTM, BA, and EDLSTM all constructed on neural networks is noticeably smaller than the one of PR, suggesting that neural networks can reflect the nonlinear characteristics of wind power. Moreover, these characteristics are better retained by the forecasting models as the networks are deeper and more tailored. On the overall average, the benchmarking PR, NN, LSTM, and BA models have RMSE that is 51.46%, 11.89%, 7.67%, and 4.46% higher than EDLSTM. This demonstrates that the proposed model can efficiently and accurately predict the power generated by the five wind Table 3 A detailed description of each data augmentation process.

Physicsoriented PA1
The Vestas V90 3 MW wind turbine corresponds to a cut-in and rated wind speed of 4 and 15 m/s, respectively, according to its power curve. Select the measured wind speed v i in the corresponding interval; v ' i = vi + X, X ∼ U[ − 0.1n, 0.1n], n = 1, 2, 3, 4, where U represents the uniform distribution. Then the power variation corresponding to the wind speed variation is calculated by Eq. (3), and new power data are generated accordingly.

PA2
The measured wind speeds are randomly substituted with 50 % probability four times with NWP wind speed data with the same timestamps, and the wind power data are added a white noise following N(0,0.1). PA3 We select measured wind speeds of the two closest turbines to the target turbine and randomly substitute, with a probability of 15% for each and a total of 30%, the target wind speed dataset. The power data are with the same treatment in PA2.

Dataoriented DA1
As described in DA1 introduction in Section 3.1.2.

DA4
As described in DA4 introduction in Section 3.1.

DA5
Four learning algorithms to augment measured wind data, such as; t is the generating data, fi() represents a single step ahead forecasting model established by learning algorithms. f1(.) is linear regression, f2(.) is support vector regression,f3(.) is classification and regression tree, andf4(.) is simple three-layer neural networks with 15 hidden neurons regression models, respectively. All four are well-established and widespread machine learning algorithms, and a detailed description of them can be found in Ref. [43] for space constraints.
Note: The units of wind speed and power in the table are m/s and MW, respectively. turbines under attention. Besides, EDLSTM's RMSE maintains relative stability with the increasing step, indicating that the seq2seq with multiple inputs and multiple outputs reduces the cumulative error in multistep forecasting. Reasonably, the forecasting algorithms outcome relatively low RMSE of the wind turbines situated on plateau and lakeside, both of which are regarded as flat terrains. In contrast, the unique fjord topography on the Norwegian coast causes wind turbines located on hilltops, valleys, and seasides to be challenging, but handled properly by EDLSTM, to predict their electricity generation. Therefore, the proposed model allows for effective and robust power predictions of wind turbines on several different topographical conditions.

Holistic validity of data augmentations
Aiming to investigate the applicability of data augmentation in wind power prediction, the original measured data are enlarged following the eight augmentation approaches presented in Section 3.1 and are predicted by the four benchmarks and the proposed EDLSTM models. The RMSE for the six to twelve-step forecasts by the forecasting algorithms based on the eight data-augmented sets is averaged separately. The results are compared to the RMSE equally averaged of the models without augmentations. Fig. 4 shows the comparison, and Table 4 offers their performance difference with paired T-test.
As can be seen, the average effect of data augmentation is tightly linked to forecasting algorithms. The RMSE of PR with data augmentation is the same as the previous one for all wind turbines in focus. The reason is there is no learning process in the PR method and its RMSE remains the same when the used data augmentations give stochastic perturbations in data or generations of new data based on patterns of primitive data. So, it is meaningless to further discuss the augmentation in the PR approach. Within one STD, there is an apparent difference, with p-values smaller than 0.05, between RMSE of all network-based NN, LSTM, BA, and EDLSTM forecasting algorithms. It can be interpreted that these algorithms can not only respectively learn the dominant or trending patterns in the input space, but data augmentations also provide additional valuable information in these network-based models training phrases.
Most notably, a significant improvement, with a statistical average difference over 0.0102, in the performance of the EDLSTM forecasting algorithm is evident with augmented input data. On the one hand, it means that the limited original data restrict the proposed deep learning model's potential or possibly cause overfitting. On the other hand, it demonstrates that the augmented data more adequately train the complex deep networks to yield better predictions by insight into more   hidden and sophisticated patterns in the forecasting. In addition, the STD of RMSE between multiple predictions shows no significant variation before and after data augmentations, which points out that the effects of data augmentations are approximate for each step. Generally, the average RMSE of augmented models of NN, LSTM, and BA separately grows by 21.47%, 13.30%, and 7.60% compared with augmented EDLSTM.
To more explicitly show outcomes of the various data-augmented models, the RMSE of each step prediction based on the eight augmentation approaches is averaged and plotted in Fig. 5.
By comparing Fig. 5 with Fig. 3, it can be found that: first, the tendency of gradually increasing RMSE persists of data-augmented multistep predictions. Secondly, the augmented EDLSTM model outperforms its counterpart based on raw data in almost every step of prediction for all wind turbines. And thirdly, the power prediction of T3 wind turbine is the best, corresponding to the RMSE of the data augmented EDLSTM model is barely less than 0.11, and the second-best one is T1. Furthermore, the predictions for T2, T4, and T5, located in complex terrain, are also significantly improved. Thus, data augmentation improves EDLSTM for power forecasting, resulting in satisfactory reductions in model RMSE errors.

Competition between diverse data augmentation methodologies
The superiority of data augmentation approaches as a whole in wind power prediction is elaborated in Section 5.2. To further investigate which data augmentation approaches are more effective, the average and STD of RMSE for each step of prediction by algorithms based on different augmentation approaches are taken and presented in Fig. 6. As can be seen, there is no obvious regularity in the average multistep forecasting performance with different augmentation-based models. That is, the results of various augmentation approaches in different forecasting algorithms are not tendentious. The overall RMSE of distinct augmentations is comparable in NN, LSTM, and BA but the opposite is the view in EDLSTM. Nevertheless, certain patterns exist for augmentations in the prediction of different turbines. Regardless of what augmentations, the errors in predictions for turbines in flatter terrain are smaller, consistent with the predictions without augmentations.
As a further statistical examination to test the variation in different data augmentation in multistep predictions, the Friedman test to answer whether there is a difference between the RMSE averages of the five wind turbines with different augmentations in the same time step. The p-values are demonstrated in Table 5. Among the power forecasts based on data augmentations for all turbines, The effect of different augmentation approaches for forecasting models is not statistically significant in most cases, such as in NN, LSTM, and most cases of BA. Particularly, the proposed EDLSTM models' RMSE, with a relatively complex p-value set, differs only in sixth and seventh step forecasts with varying augmentations. Additionally, in view of the EDLSTM's favorable outperformance in wind power forecasting, the decrease rate of average multistep RMSE for each augmented versus unaugmented model based on the same forecasting algorithm is computed. The rate is averaged among five turbines and illustrated in Fig. 7. The p-value for the multivariate comparison between these RMSE decrease rates is 0.00033, much less than 0.05, indicating that overall improvements in EDLSTM performance with various augmentations are statistically different. In general, based on RMSE, PA3, PA2, DA1, and PA1 provide modest improvements, from 7.87% to 9.96%, to the EDLSTM model, while DA5, DA4, DA3, and DA2 improve, sequentially from 10.80% to 11.36%, the model relatively substantially.
Despite the varying decrease degrees in RMSE for the EDLSTM models with different augmentation approaches, the difference is minimal between some approaches, like DA4 and DA5. To further compare the effects of different augmentations, the average MAE and QR90 of forecasts with the same scenario as in Fig. 7 are gained and their change rates before and after augmentations are calculated and tested in Figs. 8 and 9. The p-value of MAE decrease rate comparison is 0.0023, less than 0.05, also smaller than its counterpart of RMSE, which also means varying augmentations give statistically different boosts in EDLSTM. Similar to Fig. 7, the DAs are better than PAs, but Fig. 8 offers a clearer distinction between several DAs. DA4 and DA5 have a greater MAE decline, 8.97% and 8.82%, than DA2 and DA3, 8.49% and 7.79%, which generally indicates that the former two provide closer predictions to the real values. But DA4 and DA5 may have big deviations in some forecasting points, so these data-oriented augmentations are quite close in Fig. 7. The p-value of QR90 increase rate comparison is 0.0052, bigger than 0.05, which illustrates different augmentations have no significant different improvements, around 12% to 13%, in QR90. This phenomenon reveals that either augmentation technique can elevate the qualification rate of the EDLSTM model in a relatively similar amount, and provide satisfactory forecasts in terms of this evaluation index.
To summarize, the impact of the different data augmentation methods on the benchmark models is not significantly different. However, the improvement for the deep EDLSTM is slightly varied, unremarkable in QR90 metric. DAs, on the whole, outperform PAs in RMSE and MAE, and MAE further reveals that DA4 and DA5 have edges among the DA methods.

Conclusions
This paper initially scrutinizes the usefulness of data augmentation approaches in wind power forecasting and proposes a multi-input and multi-output prediction algorithm with verified superiority. Inferences on the results of multistep forecasting five wind turbines with various topologies, conclusions are given as follows.
The proposed seq2seq-based deep EDLSTM enables highly effective and robust multistep power forecasting, by highlighting the sequential dependence of the problem, for wind turbines under different terrain conditions. Also, compared with the benchmark PR, NN, LSTM, and BA algorithms, its overall RMSE is lowered by 33.89%, 10.60%, 7.12%, and 4.27%, respectively.
Since EDLSTM is a complex deep learning model, its strength requires so-called big data. It is demonstrated that five-fold expansions of the primary data with data augmentations statistically boost neural network-based NN, LSTM, BA, and EDLSTM wind power forecasting capabilities. The boost is particularly evident in EDLSTM, where, on average, the performance of the data-augmented model provides better forecasting with lower RMSE, which is 10.2% smaller than its counterpart without data augmentations. This boosting can be interpreted as expanding the training set, it is equivalent to adding a regular term to the loss function when training models, which can effectively avoid overfitting. Besides, due to the stochasticity involved in data augmentations, the learned model built on the techniques presents better robustness. Moreover, the data-augmented EDLSTM edges over the benchmarks, PR, NN, LSTM, and BA with the same expanding inputs, extending to 40.63%, 17.67%, 11.74%, and 7.06% decrease in RMSE, respectively since the proposed EDLSTM further learns deeper     information, like signal decompositions, of the wind data by mentioned augmentation techniques. The impact of the eight data augmentation approaches employed, three physics-oriented and five data-oriented, on wind power prediction is forecasting arithmetic sensitive. For the proposed well-performing EDLSTM, various augmentations can approximately, by over 12%, boost the forecasting qualification rate at the 90% threshold. But augmentations improve the forecasting performance to slightly different degrees when evaluated by RMSE and MAE: multistep and multiturbine meanly, the improvement varies from approximately 7.87% to 11.36% of RMSE and 5.24% to 8.97% of MAE within one standard deviation, and generally, data-oriented augmentations outperform physics-oriented ones. Among data-oriented augmentations, the results illustrate that EDLSTM's forecasting RMSE is significantly decreased even by simply appending noisy and randomly perturbing, or moving data the same way as sophisticated statistical data decomposition and learning data generation, however, as per MAE, the latter two provide overall closer predictions to the real power.
Our future research, on the basis of this paper, foresees to further investigate de facto more advanced data augmentation techniques and integrate them into the proposed model to conduct in-depth point and probability predictions and attempt industrial applications in extensive comparisons with other forecasting models.
Additionally, ensuing policy recommendations may be extrapolated.
Drawing on state-of-the-art deep learning techniques and increasing computational abilities, wind power forecasting and deriving data issues in energy fields shall be approached progressively from traditional statistical and parameters-sensitive classical machine learning methods to deep learning approaches that can automatically identify complex patterns. Besides, the sophisticated deep networks are particularly reliant on data amounts. Motivated by this article, limited data of wind parks or other energy sectors could be artificially enlarged by appropriate data augmentations to serve as the stepping stone for further applications of deep learning to challenge related scientific and engineering difficulties.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.