Virtual active power sensor for eolic self-consumption installations based on wind-related variables

Green energy production is expanding in individual and large-scale electricity grids, driven by the imperative to reduce greenhouse gas emissions. This research performs a comparative analysis of several linear and non-linear regression models, intending to identify the most effective method to estimate the active power produced for a mini wind turbine using meteorological variables, looking for a reliable virtual sensor. The modeling process followed a feature selection step before applying eight machine learning techniques whose results were statistically analysed to determine the best performance


Introduction
During the past decades of the 20th century and the first decades of the 21st century, the world experienced a significant evolution in terms of population, industrial activities and land, maritime and air transport [15].Consequently, the rise in energy needs represents a critical issue.In this context, the main sources of primary energy came from fossil fuels.For example, in 2020, 31% of primary energy came from oil, 27% from coal and 25% from natural gas [7].
This strong dependency on combustion technologies increased greenhouse gas emissions, causing climate change with potentially devastating and irreversible effects, such as temperature and precipitation changes or extreme climate events [15].
Furthermore, according to different estimations, the world population reached 8000 million in 2022 [22], and it is expected to register a peak in the mid-2080s.These severe circumstances describe a scenario in which immediate and strict measures must be taken.Hence, governments, corporations and organizations are expected to tackle this situation through several lines of action [25].First, the electrification of the demand can present a valuable solution to reduce greenhouse gas emissions.This includes actions such as enhancing electric mobility to substitute the vehicles that operate through conventional combustion engines or using heat pumps for domestic hot water and air conditioning.However, the importance of the first line of action depends extremely on how the electric energy is generated, if it is obtained through fossil fuels, the positive electrification effects are negligible [25].
Then, promoting installations that generate electricity through renewable energies plays a significant role in reducing or palliating the consequences of climate change.The European Green Pact represents an example of this trend signed in 2021 by EU countries, in which they commit to a zero-emission scenario by 2050 [10].In more recent times, the twenty-eighth session of the Conference of the Parties (COP 28) [1] established that to ensure the threshold of warming below 1.5 • C, or at least 2 • C, the world needs to triple the current renewable energy capacity by 2030, among other measures.
Nowadays, these policies show a sudden change in the generation structure in most countries.According to [4], the global annual renewable capacity installed in 2023 increased by almost 50%, representing the fastest growth recorded during the past two decades.This trend would imply that by 2025, renewable energies will surpass coal as the most important source of electricity generation, and by 2028, they will represent 42% of the global energy share.
After contextualizing the potential importance of renewable energy generation, it is worth mentioning that two technologies are the most promising: photovoltaic and eolic.Furthermore, both can be classified according to the generation structure: large-scale installations and selfconsumption installations.While the first ones are designed to deliver great power and supply many customers in the medium range, the second ones are low-powered and placed in the consumption point to supply the prosumer's demand.This second topology significantly reduces the distance between consumption and generation points, with the corresponding reduction in transport losses [24].However, maintenance or undesired operation conditions can threaten the promotion of selfconsumption installations, due to the need for knowledge and expertise in this specific field, which is not feasible for most prosumers.Then, using tools to supervise the current state of a renewable energy facility and detecting unexpected situations can play a significant role.In this case, using virtual sensors will ensure robust performance.This kind of tool, whose exemplified structure is presented in Figure 1, represents an important breakthrough in systems supervision.From indirect system variables, it can estimate the value of a physical magnitude (virtual value), validating the measurements obtained from a physical sensor, thus increasing the reliability of the prediction process.This can also help determine unexpected events and malfunctions or even help the maintenance decision-making process.
Several approaches can be followed to achieve energy parameters prediction [11].First, a detailed physical description of the system can be done with mathematical methods considering the area, obstacles, weather forecasting, pressure or temperature, among other parameters.However, having a physical description is not always feasible or possible, and statistical methods are presented as a solution.These methods aim to identify non-linear and linear relationships between a dataset's atmospheric and electric variables.This is carried out following both time series and regression techniques [11].In [23], a hybrid model with time series and regression models is proposed to determine wind speed using daily wind speed data.Other renewable technology is subjected to study in [26] photovoltaic power generation approach through different convolutional neural networks.
This paper deals with a real application of a self-consumption wind power installation placed in a bioclimatic house.Wind turbine represents nowadays the energy technology with more power capacity installed in Spain, with 25% of the share, according to [3].To give an idea of the continued trend during past years, wind energy grew 31% in the period 2015-2023 [3].This type of infrastructure uses air mass movement resulting from the uneven atmosphere heating by solar energy [14].Hence, the energy generated is directly proportional to the cube of the wind speed.Then, this value strongly depends on weather and atmospheric conditions, with the corresponding problems regarding seasonality and fast variability [14].
The efforts in this renewable technology are focused especially on optimizing material, structure, turbine design and energy management.Therefore, different topologies depend on the application subject to factors such as wind speed, power generated, etc [17].The rotation axis direction is an example of this design variability.Although horizontal turbines are more common, the vertical ones have a lower power-size ratio and are more reliable in case of low wind speeds [17].
Efforts in wind technology are primarily directed toward optimizing wind generator facilities, including mechanical design and energy management.Turbines for power generation are designed to produce varying amounts of power, ranging from megawatts to kilowatts, depending on factors such as size, axis direction, number of blades and application.The present research compares the performance of eight different forecasting techniques to implement a virtual sensor that predicts the active power generated by a small wind turbine from wind-related variables.This contributes to detecting unexpected performance and enhances decision-making in maintenance tasks.
This work is structured according to the following outline: after this introduction, the materials and methods used to conduct the experiments are presented in the next section.The experiments and results are included in the third section.Finally, conclusions and future works are exposed in section 4.

Materials
The ecological transition context described in the previous section enhances green energies that replace conventional fossil fuel technologies.One of the lines of action consists of promoting bioclimatic buildings that use different renewable sources to supply energy needs, reducing the environmental impact.However, this kind of building does not have a high presence in society, so different foundations and government administrations make significant investments to promote them.That is the case of the Sotavento Foundation, located in the north of the Galician region, Spain [2].
Although this research focuses on a wind energy turbine, the needs of the building are supplied by different sources, as shown in Figure 2, where yellow components are in charge of supplying electricity and the red ones cover the Domestic Hot Water (DHW) needs: -Wind turbine system.A low-power wind generator blade of 1.5 kW placed on an eightmeter tower.-Photovoltaic (PV) system: 22 polycrystalline silicon modules with 2.7 kW.
-Power grid.In case there is not enough PV or wind energy, the power network supplies the electric needs.
• DHW needs: -Solar thermal system.8 panels cover a 20 m 2 area.The hot water is stored in a 1000 l storage tank.-Geothermal system.The installation consists of a five-hundred-meter pipe connected to a heat pump of 8.2 kW.-Biomas system.pellets are burnt to achieve a variable power from 7 kW to 20 kW.
Focusing on wind generator installation, the turbine is a BORNAY INCLIN 1.500 system [2], with two blades fixed to a shaft connected to a three-phase permanent magnet synchronous generator and a housing that ensures optimal orientation.The start and stop wind speeds are 3.5 m/s and 14 m/s, respectively, with the nominal power reached at 12 m/s.This turbine has braking, which can be activated manually or automatically when the wind speed is too high.
The dataset initially had 52560 samples registered with a 10-minute sample rate during one year.After removing NaN instances the final dataset is comprised of 50834 samples.Each one contains 25 features with information about electrical, temporal and meteorological variables.However, as the main goal is to predict energy generated from meteorological variables, electrical and temporal  1 summarizes the variables taken into account in the experiment setup.

Methods
Two main steps are followed to achieve proper prediction: first, the relationship of each variable with the power generated is carried out, and then regression techniques are applied.A feature selection method is considered to select variables directly associated with wind-related parameters and the variable representing power production as the value of interest.Then, eight regression methods aim to fit the proper model to forecast the power generated, and a statistical analysis determines which is significantly better.

Correlation matrix
The correlation matrix is a fundamental component in statistical analysis, playing a significant role in understanding the relationships between variables.This calculates a square and symmetric matrix that provides an intuitive overview of the pairwise correlations among variables within a dataset [13].This represents a powerful tool for examining the strength and direction of linear relationships between variables.Matrix elements are represented by a number that ranges from -1 to 1.When the correlation is strong between two variables, the number is near 1, while a perfect negative correlation is represented by -1.Values near zero mean no linear correlation [13].

Regression methods
In this research, the performance of eight different regression methods has been compared to obtain the best predictor.A brief explanation of each technique applied is presented below.

Recursive Least Squares Regressor
The Recursive Least Squares (RLS) algorithm minimizes the sum of squared errors, calculated as the difference between the prediction made by the model and the actual value [9].This method uses a linear function that relates the variable's value to be predicted, y, as a function of features, X; and the parameters of the model, w.This relationship is expressed in Equation 1.
Where n is the number of features.The hyperparameters directly affect how w is calculated.The coefficients can be forced to be only positive, and also, the model can be configured to include or not the independent term, w 0 .It is assigned the value 0 if it is not calculated [9].

K-Nearest Neighbours Regressor
The K-Nearest Neighbours (KNN) is a non-linear algorithm [12].In this case, the predicted value is calculated based on the mean of the input data's K-nearest neighbors, meaning there is no model training; only a storing of the training data is done.The model's behavior is adjusted by modifying the number of neighbors used to make the prediction, K, and how weight is given to each neighbor.

Decision Tree Regressor
The Decision Tree (DT) method is a non-parametric method [8] that implements a tree diagram.This model splits the data into smaller and smaller groups in every node as the diagram progresses.When the final node is reached, the model predicts a numeric value.Ifthen conditions represent this implementation and result from rules learned based on the explanatory variables.The model's behavior can be adjusted by selecting the criteria followed to set the best data split in each node and the maximum depth of the tree diagram.[21] employs a decision tree regressor to predict the solar energy generation.

Random Forest Regressor
The Random Forest (RF) method groups several decision trees to create the model [6].Each decision tree is considered a regressor trained with slightly different datasets to obtain different regressors.In this way, the final result of the prediction will be an average of the results of each of the different trees used.The hyperparameters of this model include some configuration for the decision trees, the criteria and maximum depth of the tree diagram and others specific for RF models, like the number of regressors used to implement the forest.

Polynomial Regression
The polynomial regression (PN) is based on linear regression, extending its models by combining explanatory variables into n-degree polynomials [18].As in linear regression, the settings include the possibility of forcing the coefficients to be positive and calculating the independent term or not.Also, the maximum degree of the polynomial is a hyperparameters.

Bayesian Ridge Regression
The Bayesian Ridge (BR) regression, also derived from linear regression, is a method that generates a probabilistic model [16] including some regularization parameters.In this model, the output is assumed to be a Gaussian distribution with an uninformative prior to a spherical Gaussian.Modifying the values of α 1 and α 2 for α, and λ 1 and λ 2 for λ affects the precision of this Gaussian function.Support Vector Regression Support Vector Regression (SVR) method is a modification of the Support Vector Machine (SVM) method used for classification tasks.This modification generalized the classification problem, adding an insensitive region wrapping the decision function of the SVM, called −tube, which allows the model to return a continuous output to resolve regression problems.It is possible to set, as well as the −tube, the kernel type used in the algorithm, which modifies the decision function and maps the input data into other dimensional space.and the regularization parameter.

Multilayer Perceptron
The Multilayer Perceptron (MLP) is formed by the union of different neurons, which can be distributed in different layers and form a network [20].In this network, the data f lows from the input layer, where the features are received, to the output layer, where the result, which can be a regression or a classification, is found.All the intermediate layers are known as hidden layers.It is a feedforward artificial neural network.The number of neurons in each layer is selected individually, with the same neurons in the input layer as features, the same neurons in the output layer as predicted values and an arbitrary number in the hidden layers.Several functions can be selected to activate each neuron.

Statistical analysis A proper statistical analysis determines which regression technique is
significantly better than the rest to select the best model.

Kruskal-Wallis H-test
The Kruskal-Wallis test is a non-parametrical test [19] that tests the null hypothesis that the median of the data of independent groups is equal.This procedure establishes if there are significant differences between them.Kruskal-Wallis H-test returns the p-value, a statistical measure to determine that the likelihood of the data is the result of chance.It is necessary to set a significance level-usually α, used as a threshold to determine whether the p-value is statistically significant.If the p-value is less or equal to α, the null hypothesis is rejected, meaning that the data groups differ.
Tukey test Tukey's Honestly Significant Difference test [5], more known as Tukey's HSD, performs a pairwise comparison, allowing to reject or accept the null hypothesis between two groups.This test also requires setting a significance level, α.

Experiment setup and results
The setup of the experiments and their results are described in this section.

Experiments setup
This subsection provides the experiment setup, including the tools and metrics used to measure and compare the performance of the regression methods.

Data preprocessing
The feature selection is carried out according to correlation matrix interpretation.The criteria to select the features that present a strong correlation to the objective variable follows a threshold of 0.7.

Feature selection
The feature selection is carried out according to correlation matrix interpretation, selecting the features that present a strong correlation to the target variable follows a threshold of 0.7.

Regression techniques
Different configuration models will be tested, looking for each technique's best configuration.Below are the several hyperparameters and values tested, combining all the possible values.
Recursive Least Square The model's configurations affect the use of the independent term, which can be calculated or equal to 0, and the sign of the coefficients, which can be forced to be positive or not.This results in 4 different models.

K-Nearest Neighbors
Different numbers of neighbors used to make the prediction were tested.
Starting with 4 and ending with 40, all couple numbers were used.Also, two functions were used to assign the model's weight distribution: a uniform function, where all neighbors have the same weight, and a distance function, where the weight of one neighbor is inversely proportional to the distance to the query point.This results in 38 different models.

Decision Tree
The maximum diagram's depth was configured from 1 to 10, 1 by 1. Also, different criteria to make the split were used based on the absolute and squared error.This results in 20 different models.

Random Forest
The same hyperparameters and values tested in Decision Tree models were used, as well as the number of regressors (trees) employed by the model.Two, four and six trees were tested, resulting in 60 different models.

Polynomial Regression
In the polynomial regression models, the applied settings modify the polynomial degree.All polynomial degrees, from second to ninth degree, were tested.Also, the independent term is adjusted to be calculated or not.This results in 16 different models.Bayesian Ridge Regression Several combinations were tested to determine the best values for α1, α2, λ1 and λ2.Values of 1e-5, 1e-6 and 1e-7 were used, resulting in 12 possible combinations.

Support Vector Regression
The modified hyperparameters were the regularization coefficient, epsilon and the Kernel function.The first two changed between 100, 10, 1 and 0.1, while three different functions, linear, sigmoid and Radial Basis Function (RBF), were tested.This results in 48 different models.
Multilayer Perceptron A batch size of 32 samples and 200 epochs were used to train the MLPs.The MLPs are constructed with one hidden layer and a linear function as the activation function of the output layer.The barricade focuses on the number of hidden neurons and their activation function.
The number of tested hidden neurons increased by 2 from 1 to 15 neurons.The activation functions used include linear, sigmoid and hyperbolic tangent functions.This results in 24 different models.

Model evaluation
This research will use the metrics in Table 2 to determine the prediction quality.Lower values of RMSE and MAE indicate better model performance, while SMAPE returns a value from 0 to 100%, with 0% as the best scenario.In the case of the Coefficient of Determination, values near 1 mean good model behavior, and values near 0 are achieved with undesirable model performance.These metrics are taken into consideration in two steps.First, a comparison will be made from the results obtained in a 10 k-fold cross-validation, applied over 80% of the samples.Once the best model configuration is determined, the 20% rest of the samples are introduced to the model in the test phase.In this test phase, the model is trained with 80% of the samples and the remaining 20% to validate the model.
After the first step, a Kruskall-Wallis H-test will be made to establish if there are significant differences between the models' performance.Tukey's HSD test will be performed if the null hypothesis is rejected.This research chose a 5% significance level (α = 0.05).

Results
This subsection exposes the results achieved from the experiment's setup.

Feature selection results
Figure 3 shows the correlation matrix between the meteorological variables and the power generated.As this last variable is the target value, special attention will be focused on the correlated variables.
As the correlation matrix shows, the temperature, solar and humidity-related variables are not strongly correlated with the active power.The more significant correlations are wind speed-  Hyperbolic Tangent related variables, especially wind speed (Speed), Standard deviation of wind speed at 10 meters (SpeedSD10m) and maximum wind speed at 10 meters (Break10m), in that order.Considering this information, these last variables were selected as the features used to predict the active power.Figure 4 represents the resulting virtual sensor topology.

Regression techniques results
Table 3 presents the best configuration for each method.Table 4 shows the average value of the RMSE, MAE, SMAPE and R 2 obtained from each of the models in the cross-validation phase.The best result of each metric is marked in bold letters.
At the look of Table 4, it is clear that 3 out of the 4 metrics, RMSE, MAE and R 2 , show a good estimation of the active power.SMAPE is the only metric indicating a wrong estimation, with great deviations from the ground value.This is caused by the great quantity of samples where the active power is 0. In these cases, any deviation of the prediction to the ground value will get a SMAPE of 100%, following Equation 2.
A graphical comparison will be made after obtaining the best configuration for each model to predict the power generation.Figures 5a, 5b and 5c show a boxplot with the MAE, RMSE and R 2 , respectively, of each regressor.

Statistical analysis
All models perform similarly in the cross-validation phase except for the recursive least squares and Bayesian Ridge regression, which have very distant behavior and poor response.The metric chosen to perform the Kruskall-Wallis H-test was the Mean Absolute Error.This test returns a p-value of 5.37e − 13, rejecting the null hypothesis and confirming significant differences between the models' responses, at least at the MAE level.Tukey's test is performed to determine the difference between pairs, resulting in 28 different comparisons.The null hypothesis is accepted in 5 pair cases and rejected in the 23 remaining.These 5 pairs, presented in Table 5, accept the null hypothesis, meaning that the differences between their MAE results in the crossvalidation are not statistically significant, considering their performance, at least concerning the MAE, identical.6 shows the metrics obtained by each model in the validation phase.These metrics remain similar to the values obtained in the training phase, so no overfitting occurs.Modifying the batch size and epochs in the MLP's training was necessary for a proper prediction.In this validation phase, 18 batch-size samples and 50 epochs were used.
Figure 6 shows the actual values versus predicted values of the KNN and DT models, the best models obtained.The forecasting is good except for some predictions that are considerably far away from the ground value.Remarkably, neither of the models predicts a power generated with a negative sign.

Conclusions and future works
This research compares several linear and non-linear regression techniques, looking for the best model to predict the power generated by a small wind turbine from atmospheric variables, specifically wind-related ones.Each regression technique was configured by its hyperparameters to get the best prediction possible for each technique.Several variance analyses were employed to determine whether the models' differences differed significantly.
The response of the models is very similar and achieves, in general, good predictions.In most cases, they obtain a tiny MAE between 30 and 40W when the active power can reach 2000W.Even if all models behave similarly, KNN and DT stand out as the best methods for estimating active power by bringing together the best results in many metrics and presenting little dispersion in the results, visible in the box plots.With this in mind, the KNN-based model is still preferred, as it has a more robust behavior with less dispersion and outliers.This model, configured with 14 neighbors and a uniform distribution of weights, obtained in the validation phase an MAE of 32.75W, an RMSE of 71.50 W and a coefficient of determination of 0.75.The value of SMAPE is worse, at 42.81%, but this metric is distorted by the number of samples in which the ground value is 0W.
The proposal represents a valuable tool to supervise the wind turbine status.From indirect atmospheric variables, the accurate estimation can be compared with real active power values measured by physical sensors.Based on this, different tasks, such as anomaly detection or maintenance decision-making, can help to ensure system robustness.
In future works, restrictions can be applied to the dataset for better predictions.Wind speed limitations could be interesting; constraining this range can allow the model to adjust to a more stable zone and improve its performance.A combination of restrictions, like power and wind speed, can be studied too.Also, other regression techniques and pre-processing can be applied to obtain better predictions and other feature selection or extraction mechanisms not based on the correlation value.

FIGURE 3 .
FIGURE 3.  Correlation Matrix between meteorological and power generated variables.

FIGURE 4 .
FIGURE 4. Virtual sensor topology after Correlation Matrix analysis.

TABLE 2 .
Metrics used to evaluate the model performance.

TABLE 3 .
Model hyperparameters with best performance.

TABLE 6 .
Metrics obtained in the validation phase.