Correlation analysis of factors affecting wind power based on machine learning and Shapley value

An analysis of the impact of various factors on wind power can help grid dispatchers understand the characteristics of wind power output and improve the accuracy of wind power forecasting. A correlation analysis method of factors affecting wind power is proposed based on machine learning and the Shapley value. First, factors affecting wind power and the method of constructing wind power models based on machine learning are introduced. Then, to measure the influence of factors on wind power, the Shapley value is proposed based on the wind power model. In addition, calculation methods, properties, and application scenarios of the Shapley value are introduced. Finally, based on the actual data of a wind farm, the method is used to analyse environmental factors affecting wind power, and the main factors affecting the wind farm are determined. The experimental results show that the method can identify important factors affecting wind power and measure the complex non ‐ linear relation between each environmental factor and wind power.


| INTRODUCTION
With the continuous expansion of the capacity of wind power, the intermittence and randomness of wind power caused by the characteristics of wind resources have an increasingly obvious impact on the power grid. For example, they pose many operational and control challenges that hamper the reliable and stable operation of power grids [1]. Wind power generation often faces difficulties regarding reliability in terms of the generation, planning, and scheduling of the supply of electricity [2,3]. In addition, the uncertainty of power load further increases the difficulty of power grid dispatch and operation [4]. There are two measures to solve the impact of wind power and improve the system's ability to absorb wind power. First, increase the reserve capacity of the power system to improve the flexibility of the power system [5]; and second, improve the accuracy of wind power forecasting. In the former, an ample and flexible reserve capacity is bound to be detrimental to the economic operation of power grids [6]. Wind power forecasting can help power grid dispatchers to grasp the characteristics of wind power and provide a basis for adjusting the power grid dispatching schedule [7,8], improving the economic operation level of the power grid. Wind power forecast methods are mainly divided into two categories [9][10][11]: one directly predicts future wind power based on historical wind power; the other first predicts meteorological factors such as wind speed and direction, and then makes predictions based on wind power models.
However, wind power is affected by many complex factors. How to analyse the impact of various factors on wind power output quantitatively, so as to improve the accuracy of wind power forecasting and improve the awareness of wind power operation characteristics, has become a the main difficulty faced by power grid dispatching operations. It includes methods based on physical models and those based on mathematical statistics. The physical model method establishes a physical model of wind power output based on the aerodynamic principles of wind turbines, and then analyses the influence of external factors on wind power based on the physical model [12][13][14]. However, because of the complex relation between various factors and wind power output, it is difficult to establish an accurate physical model [15]. Statistical-based methods are based on actual historical wind power data, using correlation analysis and other methods to analyse the impact of various factors on wind power [16][17][18]. However, there is a highly complex non-linear relation between wind power and various factors, and there are also interactive correlations between various factors [19]. It is difficult to analyse the non-linear effects of factors on wind power quantitatively using statistical analysis methods. In addition, the correlation between factors and wind power has strong temporal characteristics, and such methods cannot measure the influence of changes in factors on the trend of wind power.
This work combines machine learning technology with the Shapley value that comes from cooperative game theory to analyse the complex non-linear effects of environmental factors on wind power. Machine learning technology has powerful feature extraction and non-linear modelling capabilities and has achieved good results in the fields of wind power prediction and wind power modelling [20]. However, the wind power model trained with machine learning technology has a black-box characteristic and cannot explain the non-linear relation between factors and wind power. This research uses the Shapley value to explain the implicit relation of the model, to realize the quantitative analysis of the influence of environmental factors on wind power. The contributions of this work are thus: 1) The importance of environmental factors' non-linear impact on wind power can be measured based on the Shapley value, to identify important factors affecting power output.
2) The complex non-linear relation between environmental factors and wind power can be analysed based on the Shapley value. And the interaction effect of two environmental factors' impact on wind power can also be analysed by the Shapley interaction value of factors.
3) The black-box characteristic of the wind power model based on machine learning is solved using the Shapley value, to gain insight into the prediction process of the wind power model and to realize the interpretability of the prediction results.
The purpose of this work is not to improve the performance of the wind power model algorithm, but based on the wind power model, to use the Shapley value to analyse the complex non-linear relation between the influencing factors and wind power. The remainder of this work is organised as follows. Section 2 analyses the main factors and characteristics that affect wind power output and the machine learning method of wind power model construction. Section 3 gives the definition, properties and calculation method of the Shapley value of environmental factors affecting wind power output. In Section 4, we use actual wind farm power data to validate the method. Conclusions are given in Section 5.

| Analysis of factors affecting wind power
The physical law of wind power generation states that [21]: where y is the output power; R is the radius of the rotor; V is wind speed; ρ is air density; and C p is the power coefficient, which is believed to be a function of (at least) the blade pitch angle and the turbine's tip speed ratio. What else might affect C p is still under debate. Currently, no formula exists to express C p analytically in terms of its influencing factors [22]. Therefore, C p is empirically estimated. Turbine manufacturers provide the nominal power curve for a specific turbine with the corresponding C p values under different combinations of wind speed, V , and air density, ρ. This expression also provides the rationale for why temperature, T , and air pressure, P, are converted into air density, ρ , to explain wind power, rather than used individually.
Although on the surface, the expression in Equation (1) suggests that the electrical power a wind turbine extracts from the wind is proportional to V 3 , an actual power curve may exhibit a different non-linear relation. This happens because the tip speed ratio is a function of wind speed, V , making C p also a function of V and adding complexity to the functional relation between wind speed and wind power.
The underlying physics of wind power generation expressed in Equation (1) provide some clues concerning a preferable power curve model. Thus [23]: 1) There appear to be at least three important environmental factors that affect wind power generation: wind speed, wind direction, and air density. This does not exclude the possibility that other environmental factors may also influence the power output.
2) The functional relations between the environmental factors and the power response are generally non-linear. The complexity comes, in part, from the lack of understanding of power coefficient C p , which is affected by many environmental factors. Because there is no analytical expression linking the power coefficient to any influencing factors, the functional form of the power curve is unknown.
3) The environmental factors appear in a multiplicative relationship in the power law equation.

| Wind power model based on machine learning
A wind power model describes the corresponding relation between environmental factors and the output power of the wind turbine. The first principal use of the wind power model is for wind power forecasting. Wind power forecasting can be done by forecasting wind speed first and then converting a speed forecast to a power forecast through the use of a power model. The second principal use of the wind power model is for turbine performance assessment and turbine health monitoring, in which the wind power model is used to characterise a turbine's power production efficiency [24,25]. In both applications, accurate modelling of the power is essential, because it underlies subsequent decision making. However, there is complicated non-linear relation between environmental factors and wind output power, and it is difficult to use mathematical analysis to establish an accurate wind power model. Therefore, the use of machine learning methods to build wind power models based on actual data has become an option. From the perspective of machine learning, wind power modelling is a typical supervised learning problem. It is the process of building a wind power model based on historical wind power data and historical environmental factor data. The process of model establishment is the process of training with historical data to minimise the expected value of loss, as in Equation (2): whereFð⋅Þ is the wind power model; Lð⋅Þ is the loss function; Y ¼ fy i g n i¼1 are output wind power; X ¼ fx i g n i¼1 are data of factors affecting wind power; is the i th instance in the data; x i j is the jth environmental factor in the ith sample; p is the number of environmental factors affecting wind power; and n is the number of training data instances.
Many supervised machine learning algorithms can be used to build wind power models, including artificial neural networks (ANNs) [26,27], support vector machines (SVMs) [28], and gradient boosting decision trees (GBDTs) [29]. Although the machine learning algorithm has achieved good results, it has a black-box characteristic and cannot explain the complex non-linear relation between environmental factors and wind power. If we can quantitatively analyse the complex relation between factors and wind power in the wind power model, we can quantitatively analyse the impact of environmental factors on wind power output power.

| Shapley value of environmental factors affecting wind
The Shapley value is a method of distributing expenditures for players based on their contribution to total expenditures in cooperative game theory [30]. Wind power is affected by many environmental factors, and the contribution of each factor value to wind power output can be measured by the Shapley value. Do not become confused by the word 'value': The factor value is the numerical or categorical value of an environmental factor; the Shapley value is the contribution of environmental factor to the wind power. After the wind power model training is complete, wind power output based on the power model can be explained by assuming that each environmental factor is a 'player' in a game where the output is the payout. The Shapley values tell us how to distribute the 'payout' fairly among environmental factors of wind power. Using the Shapley value, we can measure the contribution of each factor to the wind power output.
The Shapley value of an environmental factor value is its contribution to the wind power output based on the wind power model, weighted and summed over all possible environmental factor combinations. For an instance of historical data of factors affecting wind power Shapley value of the j th environmental factor value is defined as in Equation (3) [30,31]:  (4): whereFð⋅Þ is the wind power model; p is the number of environmental factors; and n is the number of data instances; x ¼ fx i 1 ; x i 2 ; :::x i j ; :::x i p gnS indicates factors that are not included in S. We actually perform multiple integrations for each environmental factor that is not contained in S. A concrete example follows: The wind power model works with four environmental factors x 1 ; x 2 ; x 3 ; x 4 , and we evaluate the prediction for coalition S consisting of environmental factor values x 1 and x 3 according to Equation (5):

| Properties of the Shapley value
The Shapley value is the only attribution method that satisfies the properties Efficiency, Symmetry, and Dummy, PANG ET AL.
-3 which together can be considered a definition of a fair payout [30].
1) Efficiency: The contributions of factors in a data instance must add up to the difference of predicted value for this instance and the average of the predicted values of all data instances: whereFð⋅Þ is the wind power model; p is the number of influencing factors; and n is the number of data instances.
2) Symmetry: The contributions of two factor values, j and k, should be the same if they contribute equally to all possible coalitions. Given a subset of factors affecting wind power. val then ϕ i ðjÞ ¼ 0. The properties of the Shapley value can ensure that the contribution of each environmental factor to the wind power is fairly distributed, thus reflecting the impact of each environmental factor on wind power.

| Method of Shapley value calculation
All possible coalitions (sets) of environmental factors have to be evaluated with and without the j th factor to calculate the exact Shapley value. For more than a few factors, the exact solution to this problem becomes problematic because the number of possible coalitions exponentially increases as more factors are added [31]. After the wind power model training is completed, the number of model input influence factors is fixed, and the influence factors cannot be truly excluded from the data set to calculate the Shapley value. We use an approximation with Monte-Carlo sampling to calculate the Shapley value [32]. The approximate Shapley estimation for single factor value is [33,34]: 8) Compute Shapley value as the average: First, select an instance of interest x, a feature j and the number of iterations M. For each iteration, a random instance z is selected from the data and a random order of the features is generated. Two new instances are created by combining values from the instance of interest x and the sample z. The first instance x þj is the instance of interest, but all values in the order before and including value of feature j are replaced by feature values from the sample z. The second instance x −j is similar, but has all of the values in the order before but excluding feature j replaced by values of feature j from sample z. The difference in the prediction from the wind power model is compute according to Equation (9). All of these differences are averaged and get the Shapley value of factor j according to Equation (10). This procedure has to be repeated for each factor to get all Shapley values.

| Measure the importance of impact of environmental factors on wind power based on Shapley value
Based on the physical understanding hinted by the power generation law in Equation (1), it is apparent that wind speed, direction, and air density are important factors affecting wind power. The question is how to measure the importance of these factors. The Shapley value indicates the degree of influence of environmental factors on the wind power output. The greater the absolute value of Shapley, the greater the impact on wind power. The average absolute value of the Shapley value of each environmental factor in all data instances can measure the importance of impact of environmental factors on wind power, as in Equation (11): where I j is the importance of environmental factors j measured by the average absolute value of the Shapley value; ϕ ðiÞ j is the Shapley value of the j th environmental factor of the i th instance; and n is the number of data instances.

| Shapley interaction value of two factors affecting wind power
The interaction effect is the additional combined factors effect after accounting for the individual factor effects on wind power. The Shapley interaction value from game theory is defined as [31,34]: when i ≠ j and:

| Process of the experiment
The method proposed here is used to analyse environmental factors affecting the power output of a wind farm. The detailed process of this experiment is shown in Figure 1: 1) Collect data and normalise wind power based on the rated power of the wind farm and preliminarily analyse the influence of environmental factors on wind power during exploratory data analysis. 2) Divide the data into a training set, validation set, and test set. The training set is used to train the wind power model, the validation set is used to find the optimal hyperparameters of the model, and the test set is used to evaluate the performance of the model. 3) ANNs, SVMs, and GBDTs are used to train wind power models separately, and Bayesian optimization is used to find the optimal hyper-parameters based on the validation set [35], so that the prediction performance of each algorithm is optimal. The mean squared loss is used as the loss function, as shown in Equation (14): where Lð⋅Þ is the loss function;Fð⋅Þ is the wind power model; Y ¼ fy i g n i¼1 are real wind power; X ¼ fx i g n i¼1 are data of factors affecting wind power; and n is the number of training instances.

4) Calculation of the Shapley value is based on the wind power
model, and the performance of the model directly affects the accuracy of the Shapley value. Therefore, use the test set to evaluate the three models and select the model with the best performance based on performance evaluation metric defined as the absolute error: whereFð⋅Þ is the wind power model; y i are real wind power of the i th instance; and x i are environmental factor values of the ith instance.

| Exploratory data analysis of experimental data
The datasets used in this case study come from an inland wind farm used in Giwhyun et al. [23]. The turbine specifications in this wind farm are shown in Table 1. Based on the historical environmental factors and historical power data of this wind farm, the impact of its important environmental factors on wind power is analysed. It can be seen from Equation (1) that the output power of a wind turbine is directly related to the characteristics of the wind. In addition, output power is affected by environmental factors such as the wind's direction, air density, turbulence intensity, and wind shear. Thus, the selected environmental factors are shown in Table 2.
The correlation between the normalized wind power of the wind farm and its environmental factors is shown in Figures 2 and 3. Figure 2a-d are scatterplots between wind power and wind speed, air density, turbulence intensity, and wind shear, respectively. Figure 2 shows that between the cut-in wind speed and the rated wind speed, the wind power PANG ET AL.
-5 and the wind speed have a linear positive correlation. On the surface, there is no obvious correlation between the other three factors and wind power. These scatterplots in Figure 2 are unconditional for wind speed and wind direction. In this setting, these environmental factors show no obvious effect on the power output. Giwhyun et al. [23] analyses the same variables but under different wind speeds and wind directions; the result shows that interaction effects exist among wind speed, wind direction, and other environmental factors. Figure 3 shows that wind direction also affects wind power.

| Comparison of performance of different models
ANNs, SVMs, and GBDTs are used to train the wind power model based on the training set and validation set, and the test set is used to evaluate model performance. A box plot of the absolute error of each type of algorithm is shown in Figure 4. Figure 4 shows that the mean of the absolute error of the GBDT algorithm on the test set is relatively small and the

| Analysis of factors affecting wind power based on Shapley value
Based on the wind power modelFð·Þ trained by GBDT, and the entire data set, the Shapley value calculation method is used   Table 3.
The result shows that wind speed is the main factor affecting wind power. Other environmental factors also have an impact on wind power. Their order of importance is turbulence intensity, air density, wind direction, and wind shear.

| Analyse the relation between environmental factors and wind power
The distribution of the Shapley value of each environmental factor of wind power is shown in Figure 5.  Figure 5 shows that the higher the wind speed, the greater the impact on the increase of wind power. This is consistent with the commonsense view that wind speed is the main factor affecting wind power. In addition to wind speed, wind direction, air density, turbulence intensity, and wind shear all affect wind power. The greater the turbulence intensity and the air density, the greater the impact on the decrease in wind power.
Indications of the relation between the value of an environmental factor and the impact on the wind power are shown in Figure 5. However, the form of the relation is not shown. To measure the impact of the changes of environmental factors values on wind power clearly, wind speed and its corresponding Shapley value are combined with the values of wind direction, air density, turbulence intensity, and wind shear in the form of a scatterplot, as shown in Figure 6.
In Figure 6, the horizontal axis represents wind speed, the vertical axis represents the Shapley value of wind speed, and the colours represent the values of corresponding environmental factors. Figure 6 shows that sometimes when the wind speed exceeds the rated wind speed (around 13 m/s), the impact of wind speed on wind power will be reduced. What is the reason? Figure 6a,b,d shows that when the corresponding influencing factors change, the impact on wind power is small. However, Figure 6c shows that when the wind speed is above the rated wind speed, a reduction in the impact of wind speed on wind power is caused by an increase in turbulence intensity. Compared with Figure 6, it can be seen that the increase in turbulence intensity is the main reason for the reduction in the impact of wind speed on wind power when the wind speed is above the rated wind speed.

| Comparison with partial dependence method
To verify these results, based on the wind power modelFð·Þ trained by GBDT, partial dependence [34] is used to measure the influence of wind speed and turbulence intensity on wind power. The result is shown in Figure 8. Figure 8 shows that when the wind speed is above the rated wind speed, there is a collapsed area when the turbulence intensity is high. It also shows that the increase in turbulence intensity is the main reason for the reduction in the impact of wind speed on wind power when the wind speed is above the rated wind speed. However, compared with the Shapley value, it is not obvious.

| Analysis of wind power forecasting process based on Shapley value
The Shapley value can be used to gain insight into the process of wind power model to predict wind power. Observe the impact of each type of environmental factors on the forecast results of wind power during forecasting.
Use the Shapley value to observe the wind power prediction result at a certain moment, as shown in Figure 9.
In Figure 9, the upper and lower horizontal axes represent output wind power of the wind power model, the middle vertical axis represents the average predicted wind power of  the all data instances, and the left vertical axis represents the environmental factors of wind power. The curve in Figure 9 represents the process of predicting wind power using the model. The value in brackets is the value of environmental factors. The curve starts from the position of the average predicted wind power on the lower horizontal axis and is affected by environmental factors to reach the predicted power on the upper horizontal axis. The magnitude of the impact is measured by the Shapley value, as in Table 4. The impact of various environmental factors on wind power at this moment is as follows: wind shear has little effect on wind power, air density and wind direction have a negative impact on wind power, and turbulence intensity and wind speed have a positive impact on wind power. The negative and positive results are for the difference in predicted value for this moment (data instance) and the average of the predicted values of all data instances. According to the efficiency properties of the Shapley value, the sum of the Shapley value of all environmental factors at this moment is equal to the difference between the predicted wind power at that moment and the average of the predicted powers of all data instances.

| CONCLUSIONS
This work combines machine learning technology with the Shapley value in cooperative game theory to analyse the correlation between environmental factors and wind power. This method can be used to measure the influence of environmental factors on wind power and to identify main factors affecting wind power. It can also be used to analyse the complex nonlinear relation between each environmental factor and wind power and to gain insight into the prediction process of the wind power model. Based on the historical data of a wind farm, this method is used to analyse environmental factors affecting wind power. The experimental results found that in addition to wind speed, turbulence intensity, air density, wind shear, and wind direction all affect the output of this wind farm. Moreover, when the wind speed is greater than the rated wind speed, the  Wind shear (m/s) 0.284 0.02 reduction in the influence of wind speed on wind power is mostly caused by an increase in turbulence intensity. The core idea of this research is to use Shapley values to reveal the complex relations hidden in the wind power model trained by machine learning techniques. Therefore, the performance of the wind power model directly affects the calculation effect of the Shapley value. Before using this method, an accurate wind power model should be trained as much as possible. In addition, the operating characteristics of different wind farms are different, and the analysis of the data of different wind farms using this method may lead to different conclusions.