Vertical Wind Speed Extrapolation Using Statistical Approaches

MSE), Root Mean Squared Error (RMSE), Normalized RMSE (NRMSE), Normalized MSE (NMSE), Mean Bias


INTRODUCTION
Global wind power deployment has experienced an enormous surge due to its technological maturity, ease of maintenance, minimal human interaction, and commercial acceptance.In 2022, the global cumulative wind power installed capacity reached a record high of 898.24 GW [1].This significant growth was largely driven by policy deadlines in China and the United States, both pushing for cleaner and sustainable energy source utilization.With more and more countries committing to reducing their carbon footprints, wind power will likely play an incre-asingly important role in meeting global energy demands in the years to come.
Wind speed (WS) and direction are the most fluctuating meteorological parameters.These parameters change with time, location, and height of measure-ment.Therefore, thorough knowledge of wind speed variability is important in -identifying suitable site/s for wind farm development.Usually, the wind speed mea-surements are carried out at heights below the wind turbine hub height.So, vertical wind speed extra-polation to the wind turbine hub height is useful for estimating the wind energy production over the lifetime of a wind farm.By analyzing wind speed data over a long period of time, wind energy companies can better understand the wind resources at a given site and optimize the placement of turbines to maximize energy production.Figures 1 and 2 indicate that WS increases with height.
Numerous methods have been used for the extrapolation of WS to hub heights using measured values taken at more than one height.Newman and Klien [2] discovered that the power law does not perform effectively in unstable atmospheric conditions.To determine the local wind shear exponent (LWSE), WS measurements are needed at multiple heights.The LWSE is utilized to extrapolate the WS at higher heights.Although the LWSE proves to be the reliable option for WS extrapolation, it is a highly sitedependent parameter.So, one requires LWSE for almost every site under consideration which is a cost and timeintensive process.Ayodele et al. [3] utilized WS data from 20 and 60 meters to calculate the LWSE, resulting in a more precise wind power assessment.Tizpar et al. [4] employed WS observations at 10, 30, and 40 meters to predict the LWSE, ensuring a dependable evaluation of wind power potential at the hub height.In a study conducted in Northern Cyprus, Solyali et al. [5] determined the LWSE by using WS at 50, 80, and 90 meters to accurately estimate WS at the required hub height for precise wind power potential assessment.
Boro et al. [6] investigated the characteristics of the vertical wind profile at Burkina Faso using wind data at 10 m above ground level (AGL) and satellite data at 50 m height in the atmosphere boundary.The authors used the standard power and the logarithmic laws to estimate WS data from 20 m to 50 m.Barantiev and Batchvarova [7] analyzed wind speed profile statistics from acoustic soundings at the black sea coastal site.The authors used seven years of remote sensing data covering from 30 m to 600 m heights with vertical resolution of 10 m.Steinheuer and Friederichs [8] used multivariate extreme value theory to derive a conditional distribution for hourly peak WS as a function of height.For training the system, the authors used peak WS observations at 5 vertical levels between 10 m and 250 m from the Hamburg weather mast.Sucevic and Djurisic [9] analyzed the atmospheric stability influence on wind speed profile and showed that the proposed method outperformed the classic logarithmic law.Studies on WS extrapolation with other methods can be found in [10][11][12][13][14][15].
Existing studies on WS extrapolation have shown a significant reliance on extensive site-dependent parameters, highlighting a critical gap in the field.Traditional models, especially in unstable atmospheric conditions, lack the versatility needed for diverse environmental scenarios.The reliance on the LWSE, while providing reliable results, necessitates detailed, site-specific measurements, leading to processes that are both time-consuming and costly.This issue is further compounded by the limitations in the range of heights at which WS data is collected in many studies, restricting the ability to efficiently interpolate and extrapolate WS across broader vertical spans.Additionally, the focus of many studies on specific methodological approaches, such as multivariate extreme value theory or atmospheric stability analysis, underscores the need for a more comprehensive model capable of integrating a wider array of factors affecting WS.This gap underscores the necessity for a more adaptable and less site-dependent approach in WS extrapolation research.This gap can be solved utilizing statistical methods that exploit the information provided by the available WS data measured at lower heights.These methods can model the statistical properties of the data that can be used to extrapolate WS at higher heights.
This paper investigates the performance of seven different statistical approaches for vertical WS extrapolation.The statistical approaches used include Generalized Linear Models (GLM), Linear Regression (LR), Support Vector Machines (SVM), Generalized Additive Models (GAM), Gaussian Process Regression (GPR), Regression Tree (RT), and Ensemble Regression (ER).To evaluate the accuracy of these methods, several performance metrics are used, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Normalized RMSE (NRMSE), Normalized MSE (NMSE), Mean Absolute Error (MAE), Mean Bias Error (MBE), Mean Absolute Percentage Error (MAPE), Mean Percentage Error (MPE), Symmetric Mean Absolute Percentage Error (SMAPE), and coefficient of determination (R 2 ).By examining the performance of these seven methods, it can be determined which approach provides the most accurate estimation of vertical WS extrapolation.
The main scientific merit and contribution to the practice of engineering of this paper is as follows.This study presents the use of diverse statistical models like GLM, SVM, and GPR indicating a robust and innovative approach to data analysis.This contributes to the scientific field by exploring and validating various methodologies for predictive modeling.In addition to that, the detailed assessment of performance metrics like MSE, RMSE, and MAE demonstrates a comprehensive approach to evaluating the accuracy and reliability of the models.This contributes to scientific rigor and helps in establishing benchmarks for future studies.The application of these statistical methods in engineering can significantly improve predictive modeling, essential for planning, design, and operational decisions.For instance, in energy engineering, accurate WS extrapolation is vital for wind farm installation planning.
The remainder of this paper is organized as follows.The Methodology section presents the approaches and the mathematical equations used in this paper.The experimental data collection section provides a brief description of the vertical wind speed profiler.The results section presents a comprehensive analysis of the performance of each algorithm, including a comparison of their accuracy metrics.Finally, the conclusion section summarizes the findings and highlights the significance of this study for wind energy applications.

METHODOLOGY
This section presents all statistical approaches used to perform vertical WS extrapolation and metrics to measure accuracy.

Generalized Linear Model (GLM)
The GLMs [16] are a broad class of statistical models that can handle a wide range of response variables, including binary, count, and continuous data.This paper uses a GLM with a quadratic model.It means that the GLM is fitted using a quadratic function of the predictor variables.Mathematically, a quadratic function has the form: where y denotes the response variable, β 0 is the bias, β 1 and β 2 are the coefficients of the predictor variables x 1 and 2 2 x , respectively.The predictor variable x 2 is squared to introduce a nonlinear relationship between the response and the predictor variables.
In this model, x 1 is defined as a vector representing WS measurements at p heights (h 1 … h p ). Specifically, x 1 = {x h1 ,x h2 ,…,x hp }, where each element x hi corresponds to the wind speed measured at a different height.The term 2   2   x in the model is not a separate set of measurements but rather the square of the vector x 1 .In the model, β 1 and β 2 are vectors of coefficients that correspond to the predictor variables x 1 and x 2 , respectively.This quadratic model is useful when the relationship between the response and predictor variables is nonlinear and can be approximated by a quadratic function.By fitting the quadratic model, the GLM can capture the curvature of the relationship between the response and predictor variables, which cannot be captured by a simple linear model.The coefficients, β 0 , β 1 and β 2 can be calculated as follows: where X is the data input matrix, which is constructed by stacking the predictor variable x and its squared value x 2 , and y is the response variable.In this model, the matrix X consists of input vectors from N samples: where x j (n) denotes the input vector j from sample n.

Linear Regression (LR)
The mathematical representation of a linear regression [17] model is given by the following equation: where, y is the response variable and x 1 , x 2 ,…, x p are the predictor variables.β 0 , β 1, β 2 ,…, β p are the coefficients or parameters to be estimated.The LR estimates the values of the coefficients β 0 , β 1 , β 2 ,…, β p that best fit the data by minimizing the sum of squared residuals.The variables x 1 through x p represent wind speed measurements at sequential heights, starting from the first measurement height and going up to the p-th height.The value of p, which denotes the number of measurement heights included in the model, varies depending on the specific height at which the wind speed is being extrapolated.For example, consider a scenario where the goal is to extrapolate wind speed at a height of 50 meters.In this case, the model would use wind speed measurements from lower heights as inputs to predict the wind speed at the 50-meter level.If the available measurements are at heights of 10, 20, 30, and 40 meters, then these measurements correspond to x 1 , x 2 , x 3 and x 4 respectively, and therefore, p = 4.Each x i in this sequence represents the wind speed recorded at the i-th height.This process entails employing the ordinary least squares (OLS) technique, which requires the coefficients' values that minimize the total of the squared variances between the observed response variable and the model-predicted values.Similar to the GLM, standard LR coefficients can be calculated using equation 2 where the squared values are excluded from the data input matrix.

Generalized additive model (GAM)
The next statistical method is the generalized additive model (GAM) [18] which is specified as given below: where Y is the response variable, X 1 , X 2 , …, X p are the predictor variables, α is the bias, g is the link function that maps the linear predictor to the response space, and f 1 , f 2 , …, f p are the smooth functions that capture the nonlinear relationship between the predictors and the response.The smooth function used in this paper is the logistic function as follows: ( )

Support vector machine (SVM)
A support vector machine (SVM) [19] regression model predicts the response variable based on the values of one or more predictor variables.The SVM algorithm finds a hyperplane that best separates the data into two classes.In regression, the SVM algorithm tries to find a hyperplane that best fits the data by minimizing the mean-squared error between the predicted and actual responses.Mathematically, the SVM regression model can be represented as: where y, x, w, b and e denote the predicted response, the vector of predictor variables, the weight vector, the bias term, and the error term; respectively.Basically, SVM finds the best linear fit that separates data points of different predicted values, with a focus on maximizing the margin between data points and the decision boundary.Therefore, SVM is less sensitive to outliers compared to LR.However, SVMs can be computationally more intensive than LR.This can be a drawback for very large datasets or real-time processing.
On the other hand, LR seeks to find a linear relationship between predictor variables and a conti-nuous response variable.It does this by minimizing the sum of the squares of the differences between observed and predicted values.LR is straightforward to imp-lement and interpret.It's well-suited for situations where the relationship between variables is expected to be linear.LR models are computationally efficient, making them suitable for large datasets or real-time analysis.However, LR can be significantly influenced by outliers in the data.

Gaussian process regression (GPR)
A Gaussian process regression (GPR) [20] model uses a Gaussian kernel function with an isotropic (spherical) covariance function.Mathematically, the GPR model can be described as a set of input data X = [x 1 ,x 2 ,…,x N ] and corresponding output data Y = [y 1 ,y 2 ,…,y N ], where each x n is a d-dimensional vector, and each y i is a scalar.The GPR model aims to find a function f(x) that maps the inputs to the outputs with uncertainty quantified by a Gaussian distribution.Specifically, for a test input vector x q , the GPR model estimates the output as follows: ( ) ( ) ( ) where μ(x q ) is the mean function and ε(x q ) is a random variable representing the residual error at x q .The mean function is defined as: where K(x q ,X) represents the covariance matrix where each component of the matrix is the value of kernel function k(x 1 ,x 2 ) that quantifies the similarity between x q and each input vector in X, σ n is the noise standard deviation, I is the identity matrix, and Y is the vector of output data.If there is no strong prior knowledge about the data, μ(x q ) can use a zero-mean function.The kernel function is defined as: ( ) where 2 f σ is the signal variance, is the length scale parameter, and ||x 1 -x 2 || is the Euclidean distance between the two input vectors.

Regression Tree (RT)
The regression tree (RT) [21] model is used for predicting a response variable based on aet of predictor variables.The RT for building the decision tree is based on the recursive partitioning of the predictor variables into smaller subspaces that are as homogeneous as possible with respect to the response variable.The RT algorithm works as explained below: 1.The algorithm starts with the entire dataset and selects a predictor variable and a split point that optimizes the splitting criterion.The splitting criterion is typically the reduction in the mean squared error (MSE) between the response variable and the predicted values for each subspace.
2. The algorithm then divides the dataset into two smaller subsets based on the selected split point: one subset where the predictor variable is less than or equal to the split point and another subset where the predictor variable is greater than the split point.
3. For each subset, the algorithm recursively repeats steps 1 and 2 until a stopping criterion is met.The stopping criterion can be based on the maximum depth of the tree, the minimum number of observations in a leaf node, or the minimum reduction in MSE achieved by the split.
4. At each leaf node of the tree, the response variable is predicted as the mean value of the response variable for all observations in the leaf node.
Once the decision tree model has been created using the training data, it can be used to predict the response variable for new observations by traversing the tree from the root node to the appropriate leaf node based on the values of the predictor variables.The predicted response variable for a new observation is then the mean value of the response variable for all observations in the leaf node.

Ensemble Regression (ER)
The ensemble regression (ER) [22] is a training method for ensembling the RT.The method combines the predictions of multiple regression trees to produce a more accurate and robust model.Given a training set of input-output pairs {(x 1 ,y 1 ), (x 2 ,y 2 ), ,…, (x N ,y N ),}, where x n is a vector of input features and y n is the corresponding output value, the ER algorithm proceeds as follows: a. Draw a bootstrap sample of the training data, i.e., randomly select n samples from the training set with replacement.b.Grow a regression tree on the bootstrap sample.
At each node of the tree, select the feature and split point that minimize the mean squared error of the predictions.The ensemble's prediction for a new input vector x is the average of the predictions of each individual tree.In numerical experimentation, the number of trees in the ensemble is set to 100.The tree depth is unlimited, which means that each tree is grown until all the leaves are pure or have fewer than 5 observations with mean squared error as the splitting criterion.The ER algorithm uses bagging and random feature subsampling to improve the generalization performance of the model and prevent overfitting.

Models Summary
All models used in this paper are summarized in Table 1.

Error Metrics
This paper uses several metrics that are commonly used to evaluate the performance of each method.1. Mean Squared Error (MSE) [23] is a measure of the average squared difference between the predicted and actual values.It is obtained by calculating the mean of the squared differences between the predicted and actual values: where N, y n , and ˆn y denote the number of observations, the actual value of the n-th observation, and the predicted value of the n-th observation, respectively.
2. Root Mean Squared Error (RMSE) [23] is the square root of the MSE and is also a measure of the average difference between the predicted and actual values: Normalized RMSE (NRMSE) [23] is the RMSE divided by the range of the actual values.It is a measure of the relative error of the predictions: where y max and y min are the maximum and minimum values of the actual values, respectively.4. Normalized MSE (NMSE) [25] is the MSE divided by the energy of the actual values.It is also a measure of the relative error of the predictions: 5. Mean Bias Error (MBE) [26] is the average difference between the predicted and actual values.It is a measure of the bias of the predictions: ( ) 6. Mean Absolute Error (MAE) [23] is the average absolute difference between the actual and predicted values.It is a measure of the magnitude of the errors: 7. Mean Percentage Error (MPE) [27] is the average percentage difference between the predicted and actual values.It provides the magnitude and direction of the errors: 8. Mean Absolute Percentage Error (MAPE) [23] is the average absolute percentage difference between the actual and predicted values.9. Symmetric Mean Absolute Percentage Error (SMAPE) [23] is a symmetric version of the MAPE, which measures the average absolute percentage difference between the predicted and actual values but uses the average of the predicted and actual values in the denominator.This makes it less sensitive to outliers: ( ) 10. Coefficient of determination (R 2 ) [23] is a measure of the goodness of fit of the regression model.It represents the proportion of the variance in the dependent variable that is explained by the independent variables.It is calculated as: where is the average of the measured values.R 2 ranges between 0 and 1, with higher values indicating a better fit.

EXPERIMENTAL MEASUREMENT RESULTS
WS measurements are often made up to 40 meters due to resource limitations and accordingly, estimation techniques are needed to obtain wind speed data at higher heights.Figures 1 and 2 show that WS increases with height.
The dataset used in this research on WS estimation originates from Dhahran, Saudi Arabia, and consists of 34,660 samples collected from June 20, 2015, to February 29, 2016, with measurements taken every 10 minutes.The data includes WS measurements from heights ranging from 10 to 180 meters.It is divided into three subsets for the purpose of model training and evaluation: 70% (about 24,262 samples) for training, 10% (around 3,466 samples) for validation, and 20% (approximately 6,932 samples) for testing.The research aims to estimate wind speeds at higher altitudes using measurements from lower heights, using an iterative estimation process where LiDAR-measured wind speeds at lower heights (10-40 meters) are used as reference for training models, and the output is the estimated wind speeds at higher altitudes, up to 180 meters.Although the dataset is specific to Dhahran and reflects its local atmospheric conditions, the methodologies and models developed could potentially be adapted for or compared with other locations, subject to similar data collection criteria and environmental conditions.
The estimation process involves training the models using measured WSs at lower heights as input and at higher heights as desired output.Subsequently, the proposed statistical estimation methods are trained using both actual and estimated wind speed values at lower heights to estimate wind speeds at the next level.This process is iterated until wind speeds at 180 meters, utilizing actual data from 10-40 meters and estimated data from 50-170 meters, are obtained.
As shown in Table 2, the GLM model performs well, with low values for the MSE, RMSE, NRMSE, and NMSE.However, as the height increases, the errors tend to increase as well, as indicated by the increasing values for these metrics.All of the above error values keep on increasing, slightly, as the height increases.The MBE values show slight under-prediction of WSs at most of the heights while positive bias is observed at 160 to 180 meters levels.The MAE, MPE, MAPE, and SMAPE all increase as the height increases, indicating larger errors for higher heights due to being away from the measured values of WSs at 10 to 40-meter heights.The results in Table 3 suggest that the accuracy of the LR method of wind speed estimation decreases as the height increases more rapidly than in the case of the GLM method.The lowest RMSE and MAE values are observed at 50 m, with the error increasing as the height increases.The MBE values are close to zero at the lower heights but become increasingly negatively biased higher heights, indicating a systematic underestimation of wind speed.The MPE and MAPE values show that the percentage errors increase with height, with MAPE values ranging from 2.7% at 50 m to 12.9% at 180 m.The SMAPE values also show a similar trend as MAPE, indi-cating that the errors are symmetric for the different heights.
From Table 4, it can be noticed that the GAM accuracy of the model decreases as the height increases, with the highest errors being seen at 180m.The model tends to underestimate the values, as indicated by the negative MBE values.The MAPE and SMAPE values at 180 m are 13.74% and 14.03 %, respectively.
The performance of the SVM technique is summarized in Table 5.In this case, the MBE values show that the model tends to slightly overestimate the wind speed at lower heights (50m, 70m) and underestimate it at higher heights (110m and above).The MAE also increases with height, indicating that the model is less accurate at higher heights.The MPE shows that the model tends to underestimate the wind speed at higher heights, with a maximum underestimation of around 17.5% at 180m.The MAPE and SMAPE also show similar trends, with larger errors at higher heights.Overall, this analysis shows that the SVM model is not very accurate at predicting WSs at higher heights, and its performance decreases as the height increases.Therefore, it may not be suitable for extrapolating wind speeds to heights beyond 100-110m.
In Table 6, the GPR performance for vertical wind speed extrapolation appears to be reasonable.The RMSE values range from 0.344 m/s to 1.753 m/s, which indicates that the model's predictions are generally within 1-2 m/s of the actual wind speed values.The NRMSE values are also relatively low, ranging from 0.039 to 0.132, which suggests that the model's predictions are relatively accurate and reliable across different heights.However, it's worth noting that the MPE and MAPE values are relatively high, ranging from 1.045% to 13.922%.This suggests that the model's predictions may be biased towards over-or under-estimating the wind speed at certain heights.Additionally, the MBE values are negative for most of the heights, which indicates that the model tends to underestimate the wind speed.In summary, while the GPR model appears to perform reasonably well, further investigation is needed to understand why the model may be biased in certain ways and whether there are ways to improve its accuracy and reliability.
The values of the performance measures in Table 7 provide insights into the accuracy of the RT approach.For instance, the MAE is less than 2 at all the heights.However, the MAPE and SMAPE are greater than 10% at most of the heights, which suggests that the percentage error can be quite high for some heights.While the regression tree appears to have some accuracy in predicting heights, the relatively high values of MAPE and SMAPE suggest that the model may not be suitable for all applications, especially those where high accuracy is critical.
The results of ER method, Table 8, show that the model's performance decreases as the height increases, with higher values observed for all error metrics at the highest height of 180 meters.However, the errors remain relatively small, with an RMSE of less than 2 meters per second at all heights.The model's MAE ranges from 0.157 m/s to 1.442 m/s, indicating that the predicted values are generally within 1-2 m/s of the actual values.The MAPE ranges from 2.606% to 13.795%, indicating that the model's predictions are within 14% of the actual values on average.The results indicate that the ensemble regression model appears to be a viable method for extrapolating vertical wind speed at various heights.some contexts, this threshold can be considered high for power generation purposes.Relative differences in the range of 10-20% might be acceptable in some meteorological or navigational applications.However, for wind energy generation, where accurate power predictions are necessary, these levels of relative error could be problematic.As mentioned, due to the cubic relationship between WS and power output, even a 10% error (equivalent to MAPE) in WS estimation can lead to much larger errors in power estimation.If power estimation errors are limited to 30 % or more than a p ≥ 70% of accuracy, then the WS estimation error threshold can be modeled as follows.
( ) where a WS is the WS accuracy.To achieve a p ≥ 70%, a WS is governed by the following: ( ) ( ) 0.1121 MAPE< (24) Therefore, to achieve less than 30% of accuracy, the MAPE must be less than 11.21 %.
The correlation coefficient between the measured and estimated WS values is compared through scatter diagrams and the resulting R 2 values are summarized inTable 9for all methods (GLM, LR, GAM, SVM, GPR, ER, and RT) used to extrapolate WS at different heig-hts.It can be noticed that all the models have high R 2 values up to a height of 70 meters, indicating good estimates of WS.With further height increase, the R 2 value keeps on decreasing progressively.Among the models, at lower heights (50, 60, and 70 m), SVM has the highest R 2 values followed by GLM and ER methods.However, for heights 80 to 180 m, GLM achieved the highest R 2 values.On the other hand, GPR has the lowest R 2 values for most of the heights, indicating that it is the least accurate model for estimating the WS.However, even GPR has an R 2 value above 50% for all the heights, which means that it is still able to capture some of the variation in the data.
Figure 3 shows the scatter plots of all methods at different heights, specifically at heights of 50m, 120m, and 180m, representing low, medium, and maximum heights, respectively.Upon examination of the scatter plots, it becomes apparent that as the height of the measurement increases, the accuracy of the predictions tends to deteriorate, as evidenced by the widespread between the measured and estimated values and lower R 2 values.Despite this general trend, there are still some notable differences between the different methods.At lower heights, GLM and SVM seem to perform in a similar fashion, with both models producing relatively accurate estimates.However, at higher heights, GLM outperforms SVM, as the latter tends to overestimate the WS in these scenarios.This trend is confirmed in Figure 4 where estimated WS using GLM, SVM, and GPR methods are compared with the actual WS at different heights (50 m and 180 m).It can be noticed that despite having good estimation at 50 m, SVM performs poorly at 180 m.This study aimed to evaluate the performance of seven statistical approaches for vertical wind speed extrapolation, including GLM, LR, SVM, GAM, GPR, RT, and ER.The results showed that the GLM, ER, and RT models appeared to provide reasonable accuracy in estimating the WSs across different heights.However, the accuracy of other models, such as SVM and GAM, tended to decrease as the height increased.Additionally, some models showed bias towards over-or under-estimating wind speeds at certain heights, as indicated by the MPE and MBE values.GLM arguably performed well at higher heights even as it is the only method that achieved more than 60% R 2 value at 180 m.To enhance the performance of the tested models, the study suggests exploring these statistical models in combination with global optimization methods like Particle Swarm Optimization (PSO) and Genetic Algorithms (GA).Future steps in this research domain could include applying the models to more diverse datasets, enhancing model robustness, and developing capabilities for real-time predictions, expanding the applicability and reliability of the models in various practical scenarios.

Computer program
All scripts are written using Matlab and can be accessed here https://github.com/hilalnuha/WSstatistics.

Figure 4 .
Figure 4. Extrapolated WS compared with the actual WS at (a) 50 m and (b) 180 m 4. CONCLUSION