Wind Power Forecasting Methods Based on Deep Learning: A Survey

: Accurate wind power forecasting in wind farm can effectively reduce the enormous impact on grid operation safety when high permeability intermittent power supply is connected to the power grid. Aiming to provide reference strategies for relevant researchers as well as practical applications, this paper attempts to provide the literature investigation and methods analysis of deep learning, enforcement learning and transfer learning in wind speed and wind power forecasting modeling. Usually, wind speed and wind power forecasting around a wind farm requires the calculation of the next moment of the deﬁnite state, which is usually achieved based on the state of the atmosphere that encompasses nearby atmospheric pressure, temperature, roughness, and obstacles. As an effective method of high-dimensional feature extraction, deep neural network can theoretically deal with arbitrary nonlinear transformation through proper structural design, such as adding noise to outputs, evolutionary learning used to optimize hidden layer weights, optimize the objective function so as to save information that can improve the output accuracy while ﬁlter out the irrelevant or less affected information for forecasting. The establishment of high-precision wind speed and wind power forecasting models is always a challenge due to the randomness, instantaneity and seasonal characteristics.


Introduction
Wind Power has been recognized as a promising alternative, clean energy source to fossil fuel-generated electricity. Produced by capturing the air flow that occurs in the atmosphere, wind power by nature is found to be random and instantaneous in yield, and season-dependent in short-term. For safe operation of wind mill and distribution of electricity generated from wind power, it is important to accurately forecast wind speed. Unfortunately, accurate wind speed forecasting, especially short-term wind speed forecasting (STWSF), remain quite a challenge [Wang, Zhang, Xu et al. (2018); Shao, Deng and Jiang (2018) ;Zhang, Sun, Sun et al. (2016)]. At present, wind power prediction models can be coarsely divided into three categories: physics-based prediction model, statisticsbased prediction model and hybrid prediction model. (i) Physics-based prediction models. This kind of models take into account the weather phenomena (or weather processes) and treat weather changes as non-random events that satisfy certain physical laws, such as energy conservation [Hodge and Milligan (2011);Shao, Deng and Cui (2016); Wang, Liu, Yu et al. (2019)]. There is one fundamental assumption about the physics-based models that at any given moment, the weather condition (state) is determined by the atmospheric data of the past (history). That is, the atmospheric change can be simulated/calculated through the numerical weather prediction (NWP) based on the air pressure, temperature, roughness, contour, roughness and obstacles around the wind farm. Physics-based models are able to predict the desired variables directly from real-time data, making them preferred models for wind farms. However, since this type of models requires real-time, high-precision data, there is a high demand for accurate data acquisition systems and high-speed data transmission networks. In addition, the modeling process tends to be complicated and sensitive to the errors introduced by the initial condition.
(ii) Statistics-based prediction models. This type of models begin with certain weather phenomena (or weather processes), and the evolving weather changes are treated as a random process. Unlike the previously described physics-based models, statistics-based models at different runs may predict different weather outcomes for the same ambient conditions, as the models rely on calculating the probability of occurrence of certain weather situations. As an improvement, this kind of models may sometimes include variables that have physical meanings and allow people to interpret what is going on with the weather. Quite popular among researchers and wind farm operators, statics-based models, however, place too much weight on the historical data, and they are still lack of physics to back up their prediction results ; Wu, Zhu, Su et al. (2015); Wang, Ji, Xue et al. (2016)]. Another drawback of these statics-based models is these models fail to consider the time delay in the forecasting process, and it becomes difficult to integrate the wind power time series into the forecast model. (iii) Combined prediction models. This kind of models is weighted combination of different prediction models. These combined models hold the promise of avoiding the weakness exhibited by a single model in wind power forecasting. One popular combined prediction model uses wavelet transformation to reduce the negative effects of nonstationary and non-smooth wind power time series to improve prediction accuracy, after which neural network models can be applied for better prediction of winds. In a nutshell, a combined model is able to take into account the various features of the wind power time series, which are multi-time scale and multi-resolution in nature [Hong, Li, Geng et al. (2019); Pant, Han and Wang (2019); Zhang and Hong (2019)]. Combined prediction models benefit from the rapid development of artificial intelligence, including deep learning and reinforcement learning, to help improve the classification accuracy as well as reveal the complex nonlinear relationship in wind power forecasting. The advantages of deep learning in feature selection and function approximation compared with the traditional feedback neural networks guarantees the its superior processing ability in pattern recognition and computer vision etc. In particular, the transfer learning and reinforcement learning methods etc can be used to combine with the deep learning, and then facilitates the rapid improvement in feature analysis and calculating efficiency. This paper thus is dedicated to survey how deep learning, enforcement learning and transfer learning are applied in wind speed and wind power forecasting, especially in the areas of data processing of wind power, input features selection, forecasting model framework establishment and model structure optimization. Our goal here is to provide an assessment of various network structures including feed-forward network and Recurrent Neural Network (RNN), among a few others, and determine their strengths and applicability in the context of wind modeling methods as described above. The rest of this paper is organized as follows. In Section 2, we all introduce the fundamental forecasting framework, consisting of data preprocessing and feature selection and analysis. Section 3 is devoted to make comparisons of various studies reported in the literature concerning deep learning and reinforcement learning techniques, after which model structure optimization strategies based on the deep learning will be surveyed in Section 4. Finally, Section 5 concludes the paper with discussions on future research.

Forecasting frameworks
The general wind speed forecasting framework mainly includes five components: data preprocessing, high-dimensional and low-dimensional data feature analysis, model formalization based on deep learning and reinforcement learning, model structure optimization and model performance evaluation. All these components will be discussed in this section, the general forecasting framework is given in Fig. 1.

Data preprocessing
Normally, the data derived from the wind farm contains many outliers, noisy data and missing value caused by inevitable factors. Due to the continuity of data in time as well the true information covered by noise, these data should be identified correctly by appropriate preprocessing algorithms. Outlier detection, noise elimination and missing value filling are the three most common and important aspects in data preprocessing. Due to inconsistent data (usually represented by the same input but shown in different outputs, sometimes caused by human record error), irrelevant data and local extremum, resulting that the deviation of the forecasting model are caused. In particular, sparse data may be weakened in distribution. The method commonly used for local extremum in engineering is to normalize the data to a certain ratio while maintaining the specific structure, such as data discretization and normalization, which also are used to remove the unit limits for dimensionless and weighted processing. Assuming that all data are normally distributed, then the data is model-processed, and the actual data can be obtained if the reverse operation is performed. Missing values usually appeared in the actual collected data from wind farm, which destroys the samples' integrity over time continuity. Based on the available variables around the missing value and their relationship, the interpolation function is most widely used to complete the missing values of the wind farm. Noisy data are one of the most common problems in wind farm. The most common used method is to utilize the filter method to remove the noise and obtain the actual signal estimation. The most ideal filtering technique reflecting at the corrupted data removing, effectively clean up noise while minimize signal distortion. According to the current filtering method, the commonly used wind farm data filtering strategies are divided into three categories: (i) Time-domain filtering. The median or mean of the measurement data is placed in a window with a predetermined size, and then the threshold is set in the window to determine whether the current data is reasonable. The more robust and reasonable data will be calculated and used to replace the unreasonable ones. Since this method is always obtained Figure 1: The general wind power forecasting framework through the absolute values, the corresponding estimation interval with upper and lower bounds can be directly obtained. If the data can be fitted by a normal distribution within a certain time interval, then the corresponding confidence interval estimate can also be calculated. The median filter as one of the most used time-domain filter methods used for the wind speed preprocessing are shown in Fig. 2. (ii) Frequency-domain filtering. The data is decomposed into low-frequency and high-frequency components by applying Fourier transform; the data's high-frequency component is then filtered out, so that the data are smoothed out. The only drawback of the Fourier transform is that one cannot simultaneously perform time-frequency domain analysis at the same time. Time-frequency domain filtering can also be obtained by multiplying the signal in the frequency domain by a piecewise function or by convolving with the Sinc function in the time domain.
(iii) Time-frequency domain filtering. The measured data is simultaneously transformed at the time and the frequency domains to obtain effective features of signals over a wide range. The premise of data filtering is that it assumes that data has sufficient knowledge of its neighborhood and thus filtering causes no or litter loss of the useful information. Recognized as the most widely used methods, wavelet analysis as an ideal time-frequency domain local analysis method where its time domain and frequency domain windows can be changed. For the low frequency component of the signal, the high frequency resolution and low time resolution of the time series can be analyzed while higher time resolution and lower frequency resolution can be obtained in high frequency component. Especially at large scales, the low-frequency global information of the signal can be represented, and the high-frequency local features can be represented at a small scale. Filtering with wavelet transforms is one of the most reprehensive used time-frequency methods in noisy data processing, and has been shown in Fig. 3. In addition, adaptive filtering methods such as empirical mode decomposition (EMD) [Wang, Zhang, Li et al. (2014)  ; An, Jiang, Li et al. (2011);Feng, Liang, Zhang et al. (2012) ;Yeh, Shieh and Huang (2010); Wang, Zhang, Wu et al. (2016)], is based on the linear and steadystate spectrum analysis method related to Fourier transformation, which can utilize various of time scale for signal decomposition without setting any base function, to reduce the effect of the non-smoothness of wind power data on forecasting modeling.

Feature analysis and selection
The structural design of nonlinear multi-inputs learning system is always a challenge due to the difficulties in feature selection which is of fundamental importance. Defining a set of optimized input variables or features should first take into consideration of the effects of the inputs variables on the forecasting model. Some useful and most widely used criteria are given as follows, (i) Correlation analysis. Usually, operators or engineers usually focus that if the number   [Shao, Wei, Deng et al. (2016)] of variables is too small or the variables don't have sufficient information for forecasting. It can be intuitively judged whether the candidate variable has a reference value for the output variable prediction. In addition, the full use of a prior knowledge can not only effectively improve the computational efficiency, but also effectively promote the prediction accuracy.
(ii) Computational efficiency. Increase of the number of input variables corresponds to the increase of network volume and will decrease of the network learning efficiency. Take the kernel-based Generalization Regression Neural Network and Feedforward neural network as the examples. The metrics they used are highly susceptible to high-dimensional variables, which makes them computationally expensive.
(iii) Network learning efficiency. The training process of the network will become difficult when the input of the neural network contains redundant or irrelevant variables which result in an increase in the number of local optimal solutions in the error function. Assume that the first term on the right side of (1) is the error between the actual value and the forecasted value, while the second term is the error between the actual value and the forecasted value containing the redundant variable. Obviously, the estimated error is actually greater than the real one. The models forecasting error is given as follows, where y i (t) andŷ i (t) are respectively the actual variables and their respective forecast. The error is regarded as the gradient direction of the learning process in neural network, since the influence on the learning process of the heuristic algorithm is significant. Take the widely used backpropagation algorithm based on Gradient Descent in neural network as an example, the network is susceptible to data distribution and easily converges to local extremum. In addition, the convergence speed of the network will be slower due to factors such as model redundancy parameters and noise. Especially, useless information may be added to adjust network weights, as well as other forecasts. As a result, many iterative algorithms require a near-global optimum error as constrains to supervise the learning algorithm.
(iv) Sample Dimensionality. The Curse of Dimensionality is a problem often encountered in the wind power forecasting modeling. The dimension of the model increases linearly on the surface, and the amount of model's calculation increases exponentially. In order to precisely estimate the model parameters, it is necessary to increase the number of samples in an exponential manner. In real applications, the established mapping may not fully reflect the trend of the actual data due to limited samples. In the case of the Multilayer Perceptron, due to the rapid increase in the number of samples, the Curse of Dimensionality will cause the network connection weights increase rapidly, which in turn leads to an increase in computation time and reduces its practicability. The collected wind power time series can be treated as original features in forecast model at different time slots assuming its related large dimensionality and high degree of redundancy can be properly processed. Appropriate feature selection and analysis can be of great benefit to high-precision, high-efficiency wind power forecasting modeling. Taking the Figure 4: Seasonal pattern analysis [Shao, Wei, Deng et al. (2016)] most widely used wavelet analysis as an example, the samples' wavelet coefficients in timefrequency domain can accurately reflect the its features' distribution, especially whether the training samples can provide effective information that used for the test samples to process the forecast modeling with high precision. Since the wind speed can be regarded as a cumulative superposition of different frequency components with volatility and periodicity, a wavelet transform based on short-time Fourier analysis as an effective approach is most widely used for analysis. Multi-layer decomposition of wind speed through multiscale multi-resolution wavelet analysis can be used to find components with similar frequency characteristic resolution for wind speed, after which the forecasting model can be established to reflect the characteristics of the frequency components. Seasonal pattern analysis [Shao, Wei, Deng et al. (2016)] based on the wavelet analysis is shown in Fig.  4, which is mainly used to ensure the rationality of the division of the subsets of sample spaces.

Deep learning and reinforcement learning
The deep learning models with suitable architectures are designed to help select the inputs features [Dalto, Matuško and Vašak (2015); Chang, Lu, Chang et al. (2017)], and there are two types of models, deterministic and probabilistic forecasting models is established. However, due to lack of proper feature analysis, especially feature selection process which used for wind power forecasting, the aforementioned method usually requires a large amount of computation when processing high-dimensional, multi-input wind power data, which has a negative impact on scalability. Ensemble mixture density network (MDN) with three-layer architecture [Men, Yee, Lien et al. (2016)] is given for wind power forecasting, and the probabilistic forecasting measurement is applied to verify the experimental performance based on the data from the operational wind turbine in a wind farm in Taiwan (72 h forecast horizon). The experimental evaluation demonstrates that the proposed methodology can achieve the better performance in multi-step ahead wind power forecasting. Different metric in learning algorithm are of great benefit to the acquisition of data features [Liu, Kwong and Chan (2012); Hossain, Rekabdar, Louis et al. (2015); Díaz, Torres and Dorronsoro (2015); Yan, Liu, Han et al. (2015); Luo, Sun, Wang et al. (2018)], for instance, the numerical simulation related to 1 h ahead forecasting at seven different locations of the surface radiation network indicates that the wind speed forecasting accuracy can be improved about 30% compared with the traditional benchmarks. For simplicity, the forecasting methods based on the deep learning and reinforcement learning etc is classified as the following categorized: (i) RNN or LSTM-related approaches RNN and LSTM accordingly predict the time sequence over time, which indicates that before the information entering the current processing unit, the long-term information of the input sequence needs to be pre-traversed by all hidden layer units in order. If the gradient associated with an hidden layer is minimal and prone to vanishing gradients, its improved version such as GRU (gated recurrent unit) [Fu, Hu, Tang et al. (2018)] is indeed beneficial to the solution of the aforementioned problems, in particular, for the longerterm sequence information, but its processing sequence is always limited (about 100 or so), in other words, the long-time information acquisition for longer sequences is still tricky. A novel deep learning approach based on the infinite feature selection (Inf-FS) with the recurrent neural networks (RNNs) is adopted in Shao et al. [Shao, Deng and Jiang (2018)]. The possible features related to wind forecast are first clustered into multiple feature sets, after which the identified feature sets are mapped onto the paths of a graph built for Inf-FS. Traversing such a graph helps effectively determine the significance of the features according to their stability and classification accuracy. Based on the data from the National Renewable Energy Laboratory (NREL), the experimental evaluation indicated that the short-term wind power forecasting accuracy is improved significantly by 11%, 29%, 33%, and 19% in spring, summer, autumn and winter, respectively. The one shortcoming of RNN is related to its vanishing gradient caused by the high power of the matrix. As a result, the long-term dependency of wind power time series cannot be established. Long Short-Term Memory (LSTM) network is one of the most widely used models in RNNs, which addresses the problems experienced in simple RNN [Wu and Lundstedt (1996) ;Senjyu, Yona, Urasaki et al. (2006); Barbounis, Theocharis, Alexiadis et al. (2006); Theocharis (2006, 2007); Aquino, Carvalho, Neto et al. (2010); Liu, Wu, Wang et al. (2018)], including the gradient exploding and gradient vanishing. In order to effectively analyze the mixture of long-term and short-term pattern of temporal series, LSTM, CNNs and RNNs are developed to extract the local dependency patterns of the inputs and discover the intrinsic pattern that used to describe the general trends [Lai, Chang, Yang et al. (2018)]. Experiments indicated that the proper analysis of the repetitive patterns of time series is beneficial to improve the final forecasting accuracy. LSTM network with a single hidden layer is shown in Fig. 5. LSTM network in combination with wavelet transformation as well as Elman neural network is adopted in Wu et al. [Wu, Chen, Qiao et al. (2016) (2019)] for multi-step wind speed forecasting, and the lowfrequency and the high-frequency components obtained via the wavelet transformation are respectively forecasted by the LSTM network and Elman neural network with a hope to improve the total forecasting accuracy. The innovative short-term wind power forecasting system based on the ensemble methods and transfer learning is developed to construct the improved deep learning model [Qureshi, Khan, Zameer et al. (2017)]. (ii) ELM or SELM-related approaches A novel bidirectional mechanism and a backward forecasting model based on extreme learning machine (ELM) are proposed to address the issue of ultra-short term wind power time series forecasting [Mohammadi, Shamshirband, Yee et al. (2015); Zhao, Ye, ]. Assuming that the previous measurement is unknown first, the reverse wind power time series are generated to train the backward ELM based on the optimization algorithm. The forecasting error between the forward results and backward ones are analyzed to adjust the model structure, optimize the model learning algorithm, and finally improve the forecasting accuracy. The experimental evaluation and comprehensive error analysis of 1-6 h ahead forecasting indicated that the bidirectional mechanism is beneficial to improve the wind power's forecasting accuracy. The stacked ELM (SELM) coupled with deep learning framework as well as less memory consumption are used to compare with the traditional ELM, which combined with correntropy are usually used to measure the similarity of inputs and process the ultra-short-term wind speed forecasting. The computing performance indicated that the established model has better forecasting accuracy compared with other traditional methods. The uncertainty and instability of wind power time series always brings a challenge in high-precision forecasting accuracy [Tao, Chen and Qiu (2014); Paterakis, Mocanu, Gibescu et al. (2017)]. Different from the traditional feedforward neural network, ELM as a single hidden layer feedforward neural network only needs to set the number of hidden layers, and its input weight and hidden layer biases can be adjusted to obtain the unique optimal solution, with fast learning speed and good generalization ability. However, ELM is essentially a simple network based on the least norm least squares, which cannot utilize gradient-dependent chain rules used in DNNs. This indicates that it is not conducive to the extraction of high discriminative features of wind power data. (iii) CNNs or DBNs-related approaches Wavelet analysis as the most widely used time-frequency analysis method can be used to implement of the original time series' approximation with similar frequency characteristic. This is very beneficial to reduce the impact of non-stationarity and improve the accuracy of predictive modeling. As Sehnke proposed that the function fitting for the wind speed frequency distribution can be used to reduce the uncertainties of wind speed distribution [Sehnke, Strunk, Felder et al. (2013)], and this approaches can reflect the state-of-the-art in industrial aerodynamics related to wind resource assessment. The probabilistic wind power forecasting by using the deep learning ensemble approach is given in Wang et al. [Wang, Li, Wang et al. (2017)], and the time-frequency domain analysis and extraction of features with high discriminative for which are respectively processed by the wavelet transformation and convolution neural networks (CNNs) to achieve the purpose for which the data is accurately estimated. The competitive performance of the proposed approach shows that the uncertainties in wind power data can be accurately learned. Deep belief networks (DBNs) are utilized to forecast the wind power in Wang et al. [Wang, Wang, Li et al. (2016)], wavelet transformation, spine quantile regression (QR) and DBNs are separately used to analyze the non-stationary trend, statistically synthesized and extract the nonlinear features for the wind power data. The high-stability and exceptional performance of the proposed approaches are proved based on the experimental evaluation. Similarity, combined with variational model decomposition (VMD), the LSTM networks and ELM are respectively used to forecast the low-frequency and high-frequency component of the raw data. Four experiments denoted that the proposed hybrid approach can achieve the best forecasting accuracy compared with eight traditional models, and are beneficial to extract the effective and robust features as well as the trend information. Similar with the previous works, empirical model decomposition (EMD) [Hu, Wang and Zeng (2013)], wavelet transformations and multi-scale time-scale decomposition method are developed to reduce the negative influence from the non-stationary wind power time series in shortterm. The data-driven multi-model for wind speed forecasting methodology is presented based on the ensemble neural networks. (iv) Hybrid NNs-related approaches A hybrid model related to short-term wind speed forecasting is presented based on the Wavelet Packet Decomposition, Convolutional Neural Network and Convolutional Long Short Term Memory Network (WPD-CNNLSTM-CNN) [Liu, Mi and Li (2018a)]. Designed to reduce the negative impact of non-stationary of raw data on short-term forecasting modeling, WPD is used to decompose the original wind speed time series into several sub-layers at differential levels. The CNN and CNNLSTM are respectively used to analyze the high-frequency and low-frequency of the time series and establish the forecasting model. Experimental evaluation shows that the hybrid modeling method for different frequencies can effectively improve the short-term wind speed forecasting accuracy on testing samples. Similarity, in order to overcome the non-stationary of wind power time series and analyze the wind speed's inherent volatility, the conditional mutual information, wavelet packet technique in combination with several individual models featured with different mixtures of mother wavelets are separately used to process the feature selection and construct the ensemble model [Li, Wang and Goel (2015) 2018)], so as to improve the accuracy of short-term wind speed modeling. Wang et al. [Wang and Li (2018)] proposed an innovative hybrid approach for multi-step ahead wind speed forecasting based on the Kullback-Leibler divergence, energy measure, sample entropy in combination, to achieve the purpose of the optimal feature selection and improve the calculation efficiency. DNNs are finally used to capture the long-term and short-term memory data characteristics and establish the forecasting model. The forecasting error is analyzed by the generalized auto-regressive conditionally heteroscedastic model to update the evolutional information in modeling. Wind force model related to wind disturbances is developed to analyze the weather conditions, the simulation results of which indicates that the wind-induced vibrations and pointing errors of NASA Deep Space Network (DSN) antennas shares the high degree of consistency due to wind steady state pressure [Gawronski, Bienkiewicz and Hill (1994)]. Phase Space Reconstruction (PSR) and Kernel Principal Component Analysis (KPCA) are successively adopted to dynamically select the input vectors and effectively extract the nonlinear features of the original high-dimensional feature space reconstructed by PSR. Finally, Grey Relational Analysis as well as Pesaran-Timmermann statistic etc., is applied to assess the forecasting effectiveness of the proposed approaches. A comprehensive study related to wind speed forecasting interval from a few seconds to several months based on adaptive neuro-fuzzy inference system (ANFIS) and neural networks are given in Okumus et al. [Okumus and Dinler (2016) ;Chu, Yuan, Wang et al. (2019)]. The experiments indicate that absolute percentage errors (MAPE) based on the proposed approaches are 2.2598%, 3.3530% and 3.8589% at three different locations for average wind speeds daily. In order to make full use of the higher order correlations of the variants, DNNs is designed to meet the requirements of MCP for the wind resource assessment.
(v) Q-learning-related approaches Deep learning has strong perceptual ability but with insufficient decision-making ability, and reinforcement learning has decision-making ability but lack of perceptual ability. Therefore, the effective combination of two aforementioned learning algorithms can achieve the goal of complementary advantages and enhance the generalization ability of the learning algorithm. Wei et al. [Wei, Zhang, Qiao et al. (2015)] proposed an intelligent maximum power point tracking training (MPPT) algorithm based on the reinforcement learning (RL). Q-learning method is applied based on the controller of the wind power conversion systems to establish the mapping between the system state and control actions online. The optimum wind-energy curve for maximum power points (MPPs) is then generated to control the wind energy conversion systems (WECSs) [Wei, Zhang, Qiao et al. (2015); Vandael, Claessens, Ernst et al. (2015); Wei, Zhang, Qiao et al. (2016); Wang, Zhang, Long et al. (2017)]. The model architecture and model parameters established based on the pre-learning process can effectively promote the interaction ability between wind energy conversion system and environment, so as to improve the accuracy of predictive modeling effectively. Salehizadeh et al. [Salehizadeh and Soltaniyan (2016)] proposed the fuzzy Q-learning approach for hour-ahead forecasting modeling related to the renewable resources. The experimental evaluation based on the IEEE 30-bus test system indicated that the proposed approach can model the inputs with continuous multi-dimensional variables and benefits the improvement of the computational efficiency. Zhang indicated that the combination-deep reinforcement learning (DRL) [Zhang, Han and Deng (2018)] is one of the most representative methods used to reduce the cost of the computing power and obtain the intrinsic patterns for the power systems, and has been widely used in smart grid and power system coordination control system. Xiao et al. [Xiao, Xiao, Dai et al. (2018)] proposed a microgrids (MGs) energy trading based on the reinforcement learning Q-network (DQN) to improve the utility of MGs, in particular, even in the case of a large number of connected MGs. Taking into account the MGs with wind generation, experiments indicated that the proposed scheme significantly reduce the energy consumption and benefit the improvement of the MG's utility compared with the traditional strategy. Hu et al. [Hu and Chen (2018)] proposed a hybrid scheme for wind speed forecasting based on the LSTM, ELM and evolution algorithm. The hysteresis added in the activation function of biological neural network to improve the generalization ability of ELM, and evolution algorithms are used to minimum the weighted objective function and derive the optimal number of the hidden layers of LSTM. Experiments based on four performance indices and statistical tests, denoted that compared with forecasting results derived by classical models, the ensemble outputs obtained by the proposed hybrid approach can show the superior advantages in predictive accuracy. Wind speed's stochastic and instantaneous characteristics hinder the extraction of its related variables in neighboring wind farms spatio-temporal features characteristics. A graph deep learning model is proposed by Khodayar et al. [Khodayar and Wang (2018)] to analyze the undirected graph related to wind site, which consisting of LSTM and graph convolutional deep learning architecture (GCDLA) are respectively to capture the inputs' temporal features and process the forecasting. The experimental results show that the hybrid deep learning methods based on different deep networks can effectively improve the prediction accuracy of short-term wind speed. In addition, various improved versions of LSTM, for example [Xu and Xia (2018) , still have very high hardware requirements, and even if online learning process is not required, it still needs a lot of resources to quickly converge the network, that is, the improved version still does not match the hardware acceleration to form well. Transfer learning is beneficial to shorten the convergence time of the network and achieve the goal of quickly adapting to the new testing sample. The collective, robust and stable decision output by using the proposed approach on the testing data is derived to improve the generality ability of the deep learning. Because numerical weather prediction (NWP) fails to capture data's spatially characteristics sufficiently [Felder, Kaifel and Graves (2010) , and the data-driven modeling always shows less adaptable on the new testing sample, the deep neural network in combination with the transfer learning framework [Hu, Zhang and Zhou (2016)] is given to extract a high-level representation of raw data and promote the model configuration in terms of rapid convergence. A hybrid architecture based on the revised bat algorithm (BA) with the conjugate gradient (CG) method is designed for multi-step ahead wind speed forecasting [Xiao, Qian and Shao (2017)]. Though the optimization of the weights initiations related to the hidden layer of deep neural networks (DNNs), the singular spectrum analysis in combination with general regression neural network achieved the better forecasting accuracy compared with the other existing methods. Because of the uncertainty and randomness of wind speed distribution, feature extraction with high discriminant based on the raw data is becoming more difficult, the predictive deep Boltzmann machine (PDBM) [Zhang, Chen, Gan et al. (2015)] is used to build the Long-term and short-term forecasting framework based on sophisticated deep-learning technique, then the probabilistic characteristics of wind speed are effectively captured based on the deep neural network.

Model structure optimization
Model structure optimization for deep learning is a very challenging problem. The polynomial-time algorithm does fit the convex optimization in simple neural network because all local extrema is usually treated as the global minimum [Zadeh and Goel (2013)], however, which is not suitable for deeper multi-layer neural networks. The completely reliable deep network optimization algorithm is almost non-existent because the model optimization problem is essentially Non-deterministic Polynomial hard also known as NP problem. In other words, it is not certain whether the currently trained deep network is the best model. High-precision learning is basically harmless when the training samples are free of noise. Since wind power data in wind farm even after data preprocessing still cannot completely determine whether all noise has been removed, excessively high learning accuracy forces the neural network to fit the noise contained in samples, resulting overfitting in forecasting modeling. In addition, Due to the influence of the initial weight, the neural network has a certain randomness to achieve a given learning accuracy. In practice, the most commonly used gradient descent optimization method is beneficial for obtaining sufficiently good local extrema, and heuristic method, regularization method, feature selection method, new hardware (such as GPU) and the use of gradient descent iterations as much as possible under limited time is of great benefit to improving the generalization ability of deep learning. In addition, the statistical performance measures based on the error analysis criteria, such as root mean squared error (RMSE) and mean absolute error (MAE) are widely used to compare with traditional existing techniques to verify the effectiveness of the proposed approaches in wind power forecasting. DBN is used to establish the nonlinear relationship between the historical data and variables needed to be forecasted. Deep learning with its powerful ability to capture the different patterns of wind power time series shows the significant value in both scientific and engineering application. In order to overcome the persistence statistics models disadvantages that caused by the randomness and uncontrollability of wind power times series, the deep neural network with three hidden layers in combination with stacked auto-encoders (SAE) is proposed in Jiao et al. [Jiao, Huang, Ma et al. (2018)]. The forecasting accuracy is better achieved based on the designed network architecture optimized by particle swarm optimization (PSO). The multi-step ahead forecasting results denoted that the forecasting accuracy of short-term wind power can be improved by 12%. A novel dynamical integrated approach is employed to implement the forecasting of wind speed and evaluate the potential assessment of wind energy [Sun, Qiao, Wei et al. (2017)]. The EnsemLSTM method [Chen, Zeng, Zhou et al. (2018)] is proposed by using LSTMs, Support Vector Regression (SVR) and Extremal Optimization (EO) Algorithm, to establish a nonlinear relationship between predictive output and historical data though the nonlinear-learning ensemble of DNNs. The cluster algorithm is developed to deal with various problems caused by data diversity, such as the slow transmission of information in deep network hidden layer and the inaccuracy of information expression. Experiments based on the data with sampling frequency 10 minutes from Inner Mongolia and China denoted that proposed EnsemLSTM can overcome the disadvantages of unstable prediction accuracy and low generalization ability of single model, and promote the generation of the forecasting model with high accuracy. Based on the outputs consisting of wind speed and daily solar radiation derived from NWP [Díaz-Vico, Torres-Barrán, Omari et al. (2017)], DNNs are used to capture and explore the intrinsic feature of the multi-dimensional inputs and performs of the state-ofthe-art forecasting results.

Experiments
Architecture design, combination logic, application scenarios, such as coastal wind farms, offshore wind farms and inland wind farms, time scales as well as sampling frequencies etc are usually different in forecasting, this paper only compares some of the prediction method structures in the last three years. Based on the data derived from the sampling device weak wind turbine, named as type-FD77) in a wind farm plant of East China, Shao et al. [Shao, Deng and Cui (2016)] proposed a forecasting architecture based on AdaBoost neural networks in combination with wavelet decomposition to forecast the wind speed in short-term. The sampling frequency is 5 minutes/point, and 3 variables are selected from the 18 variables such as average wind speed, average wind direction, real-time wind speed and real-time wind direction at 10 m, 50 m and 70 m, respectively, and 10 m air temperature, 10 m relative humidity and air pressure, and 24-steps (i.e., 2 h) ahead prediction is mainly considered. The average forecasting results related to TRA (traditional approach without model structure selection), FNN (Fuzzy model neural network), MSS (Model structure selection), MSS-Wav (MSS in combination with Wavelet decomposition at level Db4(2)), MSS-Ada (Neural network ensemble in combination with MSS-Wav) and HybridNN (Hybrid neural network) are given in Tab. 1. where ET represents the elapsed time in seconds. The proposed approach is a classical network structure, i.e., multiple input and single output (MISO), and consisting of three layers, i.e., input layer, hidden layer and output layer. The number of the input layers is determined by the inputs selected by the model variables selection techniques etc. Of course, the considered system is actually not static, so the numbers of inputs layers changes over the time due to the seasonal nature of the wind speed. The average forecasting errors in terms of the results that reported in Tab. 1 indicated that the proposed approach MSS-Ada is reduces about TRA: 11.17%, 9.75%, 8.73% and 9.63%, FNN: 9.58%, 9.03%, 4.41% and 6.65%, MSS: 2.28%, 8.45%, 7.03% and 5.77%, MSS-Wav: 0.78%, 0.41%, 6.55% and 4.29%, HybridNN: 7.19%, 5.30%, 3.66% and 0.98%. The experiments denoted that the generalization ability of the model will be further improved if the irrelevant variables are removed, the model order is appropriately estimated and the wind speed frequency characteristics are properly considered in the final modeling. In Shao's later work [Shao, Wei, Deng et al. (2016)], the seasonal characteristics of the wind speed is more strictly considered and reflected through the dataset division. More precisely, the division of sample subsets is no longer the traditional equalization, but according to the seasonal characteristics of wind speed distribution. This is more conducive to the training of samples and test samples of the characteristics analysis, and conducive to the mutual coverage of the spectrum. The 60%, 20% and 20% of the each divided subsets based on the data from the Yunnan wind farm are chosen to be the training sample, verification sample and testing sample, respectively. The experimetanal results related to the 12-steps (i.e., 2 h) ahead prediction is given in Tab. 2. In Tab. 2, ET, TRA, TRD and MSN are respectively the computational time in seconds, traditional approach, TRA that considered the seasonal pattern and MSS without the neural network ensemble method. The sampling frequency is 10 minutes/point. Traditional approach is a traditional model with Single Input Single Output (SISO), in other words, the random and seasonal characteristics nature of the wind speed distribution cannot be considered in the forecaing. In fact, the AdaBoosting neural network combined with multiindividual models' learning abilities can siginficantly promote the network architecture synthesis, and then improve the generalization ability of the forecasting model. This approach can be treated as a forecasting model with good robustness and high accuracy, which has a range of engineering applicatons because it can reduce the errors caused by the artifically settting parameters etc. In order to effectively reflect the seasonal characteristics of the wind speed in forecasting modeling, there are k = (6, 12, 18, 24) steps (corresponding to 1, 2, 3 and 4 h) ahead of actual wind power forecasting is also proposed, and the averaged forecasting results obtained for each season based on the data division derived by the seasonal pattern are tabulated in Tab. 3. FS: specified steps ahead forecasting; RMSE1, RSD1, RMSE2 and RSD2 are the RMSE (Root mean square error) and RSD (Relative standard deviation) of the LM (Levenberg-Marquardt) output and BFGS (Quasi-Newton methods) output in the training. RMSE3, RSD3, RMSE4 and RSD4 are the RMSE and RSD of the LM output and BFGS output in the testing. ET indicates the elapsed time in seconds. Although the high predication accuracy of the proposed approach comes at a high price of elapsed time, it is still a robustness approach for the wind power forecasting. Usually, the forecasting performance deteriorates along with the increment of the forecasting-steps. In particular, the small error in wind speed forecasting usually causes the big errors in wind power forecasting. The forecasting model should have the ability of the error correction, dynamical feedback and adaptive adjustment in real application. The forecasting results obtained by the proposed method are compared with other two classical methods, i.e., persistence and new reference, which indicated that the forecasting errors MAPE related to persistence model about the 1 hour-ahead, 24 hahead and 48 h-ahead prediction is respectively reduced by 93.74%, 82.04% and 82.46%. In the case of the new reference, MAPE is respectively reduced by 94.10%, 77.51% and 77.92%. As we mentioned before, with the powerful capability of wavelet transformation in time-frequency domain analysis, the instantaneous, randomness and non-stationary of wind speed distribution in short-term are greatly reduced, which is of great benefit to the improvement of prediction accuracy. As the Zhang et al. [Zhang, Chen, Gan et al. (2015)] reported, the 1-h ahead forecasting results of five feature selection methods is given in Tab. 5, Taking into account the CMIFS, the forecasting accuracy is improved by 15.5%, 21.0%, 19.7% and 67.5% in MAPE refer to the preceding four methods, respectively.  In fact, the forecasting accuracy is impossible 100% correct, forecasting model structure design methods need to take full account of the distribution characteristics of wind power samples, seasonal characteristics and dynamic adjustment of model capacity, especially for deep learning and online learning, in order to better improve the forecasting accuracy of prediction and adapt to practical engineering applications. Similar with the forecasting  [Zhang, Chen, Gan et al. (2015)] work, the forecasting model combined with the wavelet transformation and deep learning approach is still used in wind speed forecasting by Wang et al. [Wang, Wang, Li et al. (2016)]. According to the deterministic 1-h ahead forecasting error, the forecasting error in four seasons and average results is reduced by 6.43%, 5.56%, 4.76%, 8.81% and 6.39%. This also indicated the hybrid approach which can effectively combine the advantages of different methods can effectively improve the prediction accuracy of wind power.

Conclusions
In this paper, the wind power forecasting modeling based on deep learning was formulated and discussed. The fundamental forecasting frameworks related to the deep learningrelated methods, consisting of multi-feature and single-feature selection are firstly given. The literature comparison of the deep learning-related approaches, and the advantages and disadvantages of various related methods as well as the improvement strategy are then discussed in detail. The performance evaluation criteria and model optimization strategy for deep learning-related approaches are finally provided. We believe that the proposed strategy in this paper is beneficial to improve the relevant researchers and engineers' understanding and application in architecture design, model structure selection and optimization in wind power forecasting modeling based on deep learning-related approaches.