A Unique Unified Wind Speed Approach to Decision-Making for Dispersed Locations

The repercussions of high levels of environmental pollution coupled with the low reserves and increased costs of traditional energy sources have led to the widespread adaptation of wind energy worldwide. However, the expanded use of wind energy is accompanied by major challenges for electric grid operators due to the difficulty of controlling and forecasting the production of wind energy. The development of methods for addressing these problems has therefore attracted the interest of numerous researchers. This paper presents an innovative method for assessing wind speed in different and widely spaced locations. The new method uses wind speed data from multiple sites as a single package that preserves the characteristics of the correlations among those sites. Powerful Waikato Environment for Knowledge Analysis (Weka) machine learning software has been employed for supporting data preprocessing, clustering, classification, visualization, and feature selection and for using a standard algorithm to construct decision trees according to a training set. The resultant arrangement of the sites according to likely wind energy productivity facilitates enhanced decisions related to the potential for the effective operation of wind energy farms at the sites. The proposed method is anticipated to provide network operators with an understanding of the possible productivity of each site, thus facilitating their optimal management of network operations. The results are also expected to benefit investors interested in establishing profitable projects at those locations.


Introduction
Growing global interest in reducing the environmental pollution created by heavy reliance on oil derivatives for the production of electric power has motivated governments to take significant steps toward the implementation of renewable energy. One of the most important renewable energy sources is wind, with the 2019 total world capacity of wind energy estimated to be 650 gigawatts [1] and the annual global increase in wind energy calculated at 20% [2]. This expansion has resulted in wind energy technology becoming a principal source of energy in terms of sales and technical development. In spite of these advances, this energy resource remains unreliable at high rates, and increasing dependence on this technology is associated with the emergence of numerous problems for electrical system operators. Examples of these challenges are the substantial changes in wind production arising from the random behavior of wind speeds, as well as the might in fact be considered a good choice for a specific period. These studies also relied on the assumption that an appropriate distribution for all sites is a Weibull distribution. Since such an assumption is neither accurate nor valid for all sites, the results could be over-approximations, according to [16][17][18]. To the best of our knowledge, no study has taken into account either wind speed data collected for different, distanced locations or the processing of those data as a single package to maintain the characteristics of the correlations among locations and thus to provide more accurate and detailed standard measures of wind speed productivity at those locations.
Addressing this point represents the core contribution of the work presented in this paper. Data mining techniques have recently been used in numerous applications because of the benefits these techniques offer with respect to developing models and making decisions.
Several studies have employed artificial intelligence techniques for renewable systems. For example, artificial neural networks are used in [19] to characterize PV modules. Application of data mining procedures that include support vector machines and fuzzy logic is also applied in several studies. In [20], a new methodology combining both Gaussiankernel support vector machine and adaptive fuzzy inference system is developed. This methodology extracts the fuzzy rules directly from the training data to be used in the testing stage. In [21,22], EEG signals are analyzed using SVM, ANN, Naïve Bayes, and decision trees for epilepsy detection. In [23], authors have used the decision tree technique to detect adverse drug reactions and the system was optimized using a genetic algorithm. An efficient feature selection method was developed in [24] for enhancing Arabic text classification. In [25][26][27], texture classification techniques are developed based on independent component analysis and naïve base classifier.
In this study, a decision tree algorithm is used and the major contributions of this study in comparison to existing studies are as follows:

1.
A unique and unified method for predicting wind speeds at diversified locations in the KSA is proposed. The proposed model enables the examination of deviations and correlations of wind speeds at different locations.

2.
A model is developed that deals and examines an extensive range of data for a variety of sites. In addition, conclusions about the characteristics of these data using the least possible number of classifications can also be drawn to facilitate the understanding of the data and to expedite their use. The goal was to help decision makers arrive at quick, accurate, and informed decisions.

3.
Finally, the capability of the assessed locations can be ranked to enable system operators to ascertain in advance the monthly productivity of each site so that they can implement appropriate planning and operating actions.

System Design and Methodology
This section provides details about the developed prediction system, which is based on a decision tree algorithm. Numerous decision tree algorithms are currently available, including random forests, random trees, the J48, and classification and regression trees (CART). A decision tree algorithm employs training data to build a tree model that is used for classification purposes. The developed classification algorithm involves three phases: data gathering, data preprocessing, and learning and classification. In the data-gathering phase, the training and test set is collected from wind station databases. The second phase involves the preprocessing of the data, including outlier detection and elimination, missing data treatment, and averaging. In the learning and classification phase, the goal is to develop an intelligent decision mechanism. A test set is then applied for determining the accuracy of the developed model.

Data-Gathering Phase
The five locations whose wind speed data were examined in this study were carefully selected to include all regions of the KSA [28]. Five sites were chosen to be representative Sustainability 2021, 13, 9340 4 of 17 of each region: center, east, west, south, and north. The selection corresponds to the operational divisions of the Saudi Arabian electrical system. Figure 1 shows the sites where the data were collected.

Data-Gathering Phase
The five locations whose wind speed data were examined in this study were carefully selected to include all regions of the KSA [28]. Five sites were chosen to be representative of each region: center, east, west, south, and north. The selection corresponds to the operational divisions of the Saudi Arabian electrical system. Figure 1 shows the sites where the data were collected.  Table 1 provides a statistical summary of the data collected for each site. These statistics are a collection of indices that provide meaningful information regarding the location and variability of the data. To facilitate their interpretation, brief definitions of some of the statistics are given here [29]. The most common indicator of the central tendency of a random variable is the mean, which represents the average number of data points. For the selected sites, it can be noted that the means are about 3 m/s to 4 m/s, with the exception of the east region, where 1.9 m/s is the recorded mean. The standard error (SE) is the measure that indicates how close the mean of the sampled data is to the true population mean. An SE of 0.05 or less implies that the sample data are quite similar to those for the whole population, with a confidence level of 95%. As can be observed from a review of the results, the SE values for all sites are less than 5%, so the data sample for each site is thus large enough to represent the true population. The median is another measure of central tendency, and the mode refers to the most frequently or commonly occurring number in the data. Standard deviation and variance denote the spread of the data distribution. Kurtosis identifies whether the tails of a given distribution contain extreme values. Skewness is the measure of the symmetry of distribution, and it differentiates extreme values in one versus the other tail. The minimum is the smallest value in the data set while the maximum is the largest value in the data set. The sum shows the summation of the wind speeds of all data sets. The count shows how many items the data have. The results listed in Table 1 reveal noticeable differences among the statistical values associated with different sites. These discrepancies were expected due to the divergent distances between the sites and the diverse nature of the local weather.  Table 1 provides a statistical summary of the data collected for each site. These statistics are a collection of indices that provide meaningful information regarding the location and variability of the data. To facilitate their interpretation, brief definitions of some of the statistics are given here [29]. The most common indicator of the central tendency of a random variable is the mean, which represents the average number of data points. For the selected sites, it can be noted that the means are about 3 m/s to 4 m/s, with the exception of the east region, where 1.9 m/s is the recorded mean. The standard error (SE) is the measure that indicates how close the mean of the sampled data is to the true population mean. An SE of 0.05 or less implies that the sample data are quite similar to those for the whole population, with a confidence level of 95%. As can be observed from a review of the results, the SE values for all sites are less than 5%, so the data sample for each site is thus large enough to represent the true population. The median is another measure of central tendency, and the mode refers to the most frequently or commonly occurring number in the data. Standard deviation and variance denote the spread of the data distribution. Kurtosis identifies whether the tails of a given distribution contain extreme values. Skewness is the measure of the symmetry of distribution, and it differentiates extreme values in one versus the other tail. The minimum is the smallest value in the data set while the maximum is the largest value in the data set. The sum shows the summation of the wind speeds of all data sets. The count shows how many items the data have. The results listed in Table 1 reveal noticeable differences among the statistical values associated with different sites. These discrepancies were expected due to the divergent distances between the sites and the diverse nature of the local weather. The data is a part of the Renewable Resource Monitoring and Mapping (RRMM) program prepared by King Abdullah City for Atomic and Renewable Energy (KACARE). KACARE monitored and recorded the wind speed data at different installed stations in the Kingdom of Saudi Arabia at 3 m height. Table 2 provides an example of data for one of the five sites. The size of the sample is associated with the amount of information provided and the determination of the precision or level of confidence about the desired estimate. Wind speed estimate always has an associated level of uncertainty, which depends upon the underlying variability of the data as well as the sample size: the smaller the sample size, the greater the uncertainty in the estimate. Similarly, a larger sample size can provide more information, thus the uncertainty is reduced. In this study, the sample size in all selected sites ranges from 19,000 to 25,000 data points. We tried to collect this large sample size to reduce the amount of uncertainty associated with the estimate and achieve reasonable results. The steps involved in the proposed model through the Weka tool consider different concepts of data mining, which are as follows. First, the Weka software allows preprocessing step for raw data to detect the outliers and irrelevant data by cleaning and clustering the data using the k-means technique. In addition, the data mining techniques cater to the uncertainty. This is noticed in the used decision tree methodology when applying the Gini impurity measure to decide the optimal split from a root node and subsequent splits. The Gini impurity measures the frequency at which any element of the dataset will be mislabeled when it is randomly labeled. The entropy is another way of measuring that is based on the selection of the optimum split for the features with less entropy. A subset of the combined database is shown in Table 3. The data were collected from 9 January 2013, to 31 December 2016. The subset consists of 34,872 records. The information in Table 3 is only a small subset of the available database. Zero irradiances for the north region in this table were recorded at 4 and 5 am; this is normal at sunset time when the sun disappears.

Data Preprocessing Phase
Data preprocessing includes data cleaning and missing data treatment. In this phase, information not needed for the wind speed model, such as the irradiance and the latitude and longitude, are removed from the database. Wind speed data missing for a specific date are then replaced by the average value of the wind speeds for that day [30][31][32][33]. That date is eliminated and simply replaced by the corresponding month; i.e., 29/05/2013 10:00:00 is replaced by May, as shown in Table 4. The combined database is then rearranged to add an output label to a new set of input attributes. The new set of input attributes are defined as indicated in Table 5: month, center wind speed, south wind speed, east wind speed, north wind speed, and west wind speed. The output attribute consists of multi-labeled data: case 1 to case 120. Since the number of locations is five, the resultant possible number of output cases is 5! = 120 possibilities. With the use of an association rule algorithm [34][35][36][37], the number of possible cases can be decreased to eight. The association rule algorithm caters for the correlation between wind speeds in different areas.
The association algorithm can be summarized in the following steps: Step Step 2: Calculate confidence of the generated rules, i.e., if A then B using: Confidence = number of records containing both A and B number of records containing A Step 3: Calculate support of the generated rules, i.e., if A then B using: Support = number of records containing both A and B total number of records Step 4: Check if support is less than a pre-defined threshold, i.e., minsup.
Step 5: Check if confidence is less than a pre-defined threshold, i.e., minconf Step 6: Prune rules that fail the minsup and minconf thresholds. The wind speed of each location is labeled using a rank-based system. The developed ranking system distributes wind speeds evenly, measuring them only relative to a given location, but not according to the real value of any given speed. The developed rankingbased system includes five labels that identify the level of the wind speed: very high (VH), high (H), medium (M), low (L), and very low (VL). The database resulting after the labels have been assigned based on the wind speed ranking is shown in Table 6. To minimize the number of output attributes, an association rule algorithm is applied for analyzing all of the relations between the cases. Table 7 shows the resulting cases and the corresponding locations of the rules that produce support and confidence levels greater than a given minimal support threshold (minsup = 0.01) and a given minimal confidence threshold (minconf = 0.5).  Table 8 provides a sample of association rules with their support and confidence levels. The table shows the minimum number of cases that can be achieved using the association algorithm with a unity confidence level.   Figure 2 summarizes all of the steps described above for the data-gathering and preprocessing stages.  Figure 3 displays a flowchart of the developed classification algorithm, which governs the processing of the data through three stages: training, testing, and validation. First, the training data are applied to the decision tree algorithm to obtain the initial model. For each iteration, the accuracy and precision are then calculated as a means of achieving the optimal model; the test data are applied so that the performance and efficiency of the model can be verified; and in the final step, the remaining verification data are employed  Figure 3 displays a flowchart of the developed classification algorithm, which governs the processing of the data through three stages: training, testing, and validation. First, the training data are applied to the decision tree algorithm to obtain the initial model. For each iteration, the accuracy and precision are then calculated as a means of achieving the optimal model; the test data are applied so that the performance and efficiency of the model can be verified; and in the final step, the remaining verification data are employed to ensure that the results produced by the model have a high degree of accuracy and precision.  A decision tree partitions the input space of the dataset into mutually exclusive regions by assigning each region a label. The decision tree begins with a root node and ends with a leaf node [23]. Multiple branches are formed between the root and the leaf nodes. The decision tree algorithm is performed based on splitting data into multiple regions and each region is divided into small parts. Furthermore, splitting continues until the terminal node reaches leaf nodes. The splitting is formed based on an impurity measure. Two common measures are used to obtain impurity values, Gini index, and entropy. In this paper, entropy is used as impurity measure that evaluates the homogeneity of the partition nodes too. The following steps summarizes the decision tree algorithm.

Learning and Classification Phase
Step 1: the entropy of the root node with n branches is calculated as where p is the fraction of records that belongs to class i at the node.
Step 2: the entropy of each partition with J sub classes is calculated as E(partition ) = − p log p Step 3: The branch entropy is calculated using the individual k partition entropies as A decision tree partitions the input space of the dataset into mutually exclusive regions by assigning each region a label. The decision tree begins with a root node and ends with a leaf node [23]. Multiple branches are formed between the root and the leaf nodes. The decision tree algorithm is performed based on splitting data into multiple regions and each region is divided into small parts. Furthermore, splitting continues until the terminal node reaches leaf nodes. The splitting is formed based on an impurity measure. Two common measures are used to obtain impurity values, Gini index, and entropy. In this paper, entropy is used as impurity measure that evaluates the homogeneity of the partition nodes too. The following steps summarizes the decision tree algorithm.
Step 1: the entropy of the root node with n branches is calculated as where p is the fraction of records that belongs to class i at the node.
Step 2: the entropy of each partition with J sub classes is calculated as Sustainability 2021, 13, 9340 10 of 17 Step 3: The branch entropy is calculated using the individual k partition entropies as where n i is the number of records at partition i, n is number of records at branch, and E is the entropy.
Step 4: The GAIN Split which is used to decide the best partition is the best. The partition that produces the most reduction is chosen The GAIN Split is shown below where E is the entropy. If all input attributes are used, the algorithm for decision tree induction is as shown in  If the prediction order is requested for a specific month and the w available at that moment, the decision tree induction model shown This model is based on a single input attribute: "month".  If the prediction order is requested for a specific month and the wind speeds are unavailable at that moment, the decision tree induction model shown in Figure 5 is used. This model is based on a single input attribute: "month".  If the prediction order is requested for a specific month and the wind speeds are unavailable at that moment, the decision tree induction model shown in Figure 5 is used. This model is based on a single input attribute: "month". A new model, Model 2, is implemented based on the output of the previous model, Model 1, as shown in Figure 6. The implementation involves a comparison of the output for the five cases generated from the first model with that of the eight cases from the original training data. The output from these five cases along with the output from the original cases is then used as input to a similarity algorithm. A new model, Model 2, is implemented based on the output of the previous model, Model 1, as shown in Figure 6. The implementation involves a comparison of the output for the five cases generated from the first model with that of the eight cases from the original training data. The output from these five cases along with the output from the original cases is then used as input to a similarity algorithm.
A new model, Model 2, is implemented based on the output of Model 1, as shown in Figure 6. The implementation involves a comp for the five cases generated from the first model with that of the eight inal training data. The output from these five cases along with the outp cases is then used as input to a similarity algorithm. Next, the similarity algorithm measures the similarity score bet and each case from the original data, i.e., Case 8 is similar to Case 4, Case 3, and Case 6 is similar to Case 5. The algorithm relies on edit technique for quantifying how dissimilar two strings (e.g., words) based on a count of the minimum number of operations required to string into the second. The edit distance between two cases for the Next, the similarity algorithm measures the similarity score between the five cases and each case from the original data, i.e., Case 8 is similar to Case 4, Case 7 is similar to Case 3, and Case 6 is similar to Case 5. The algorithm relies on edit distance, which is a technique for quantifying how dissimilar two strings (e.g., words) are to one another based on a count of the minimum number of operations required to transform the first string into the second. The edit distance between two cases for the five locations is the minimum number of operations required for transforming one case into another case. For example, the edit distance between "case 1 case 2 case 3 case 4 case 5" and "case 1 case 3 case 2 case 4 case 5" is two. A flowchart of the similarity algorithm is shown in Figure 7. minimum number of operations required for transforming one case into another case. For example, the edit distance between "case 1 case 2 case 3 case 4 case 5" and "case 1 case 3 case 2 case 4 case 5" is two. A flowchart of the similarity algorithm is shown in Figure 7. The resulting similarity pairs are employed for reprocessing the original training data through the replacement of the original cases with the similar cases, as shown in Table 9 compared with Table 6. The final step is that the resulting training data are applied for teaching Model 2, with the use of the decision tree as previously performed for developing Model 1. The degree of accuracy of Model 2 is then increased to 100%.  The resulting similarity pairs are employed for reprocessing the original training data through the replacement of the original cases with the similar cases, as shown in Table 9 compared with Table 6. The final step is that the resulting training data are applied for teaching Model 2, with the use of the decision tree as previously performed for developing Model 1. The degree of accuracy of Model 2 is then increased to 100%.

Experiments and Results
For this study, Waikato Environment for Knowledge Analysis (Weka) software was employed [38] for constructing decision trees according to the training set, using the standard J48 algorithm [39][40][41][42]. This algorithm has been selected as one of the top 10 algorithms in data mining [43]. Java was used as the development language with J2SDK version 1.6.0_22. Weka version 3.8.4 was employed for the experimental component of the model development.
The first use of Weka software is to do data pre-processing before applying machine learning algorithms on it. The wind speed data for selected sites are recalled from CSV files. This can be done by clicking the "Open file" button and loading the data file. The loaded dataset is then processed to Cross-validation to randomly partition the data into k subsamples for training and testing. The number entered in the Fold section is used to divide the dataset into the number of Folds specified. Then classifier J48 is used as a decision tree to create a pruned tree. The Classifier Model part illustrates the model as a tree and gives some information about the tree, like the number of leaves, size of the tree, etc. Next is the stratified cross-validation part and it shows the error rates. It shows how successful the model is. By right-clicking "Visualize tree", the developed model's tree can be visualized.
The performance measurements for this work were recall, precision, the classifier F1-score, and accuracy. Examining the data for accuracy and precision establishes the credibility of the results. Accuracy refers to how closely the measurements match the desired "true" value. Precision indicates how well repeated measurements agree with and are approximate to one another. As with the order of decisions about wind speed location, it is important that the values be close, i.e., a high level of precision, and at the same time, that the decisions be correct, i.e., a high degree of accuracy. The accuracy and the precision is defined in (5) and (6) where T P is true positive, T N is true negative, F P is false positive and F N is false negative. The true positive and true negative is the outcomes where the developed model correctly predicts the cases. By contrast, a false positive and a false negative are the outcomes for which the model incorrectly predicts the cases.
Precision(P) = T P T P + F P (6) Recall (R) is the ratio of the accurate data to the total relevant data. Its formula is shown in (7).
where T P is true positive and F N is false negative. The classifier F1-score is calculated based on the harmonic mean. It is given as where P is the precision and R is the recall. The performance measurement results are listed in Table 10. Measurements from another performance indicator established with the use of a confusion matrix are presented in Table 11. The confusion matrix was built based on the data testing, and a confusion matrix was constructed for each class in the form shown in Table 12. Because of the limited number of training cases, exercising care when minimizing and reserving the number of training samples for testing purposes is extremely important. Cross-validation was employed for testing, checking, and verifying the generalizability of the model. In training any model, a frequent tendency is to overfit, and cross-validation was applied as a means of avoiding this effect. The best way to improve the performance of a system is to reserve a small portion of the training data itself for use in validating the model since this approach provides an idea of the ability of the model to predict the previously unseen reserved data. K-fold cross-validation is a technique commonly used for this purpose. In a 10-fold version of k-fold cross-validation, the training set is randomly split into groups of 10 that have approximately the same size. The classifier is then trained using eight subsets. One of the two remaining subsets is used for validation and the last, for testing. This process is repeated until all folds, one by one, have an opportunity to be the assigned test version. This technique establishes the generalizability of the model, especially when limited data makes it difficult to break the data down into test data and training data. Table 13 shows the average degree of accuracy for 2-fold, 4-fold, 6-fold, and 8-fold cross-validation and for the 10-fold cross-validation used in this paper.   In this research, a unique system was developed to arrange places according to wind speed. The process was carried out through three stages, i.e., the data collection stage, the processing stage, and the design stage. In the first stage, data are collected from different places, for example in the center, north, south, east, and west of the region. These data contain wind speed and other additional information such as location data from longitude and latitude and the date of collected samples. The data are collected in a central database and this database contains all the information deduced from the databases spread in different places. In the second stage (data processing stage), the information that is not useful in this research, such as longitude and latitude, is discarded and the date is replaced by the month. Then the central database is rearranged and the number of cases is reduced by using the association rules (a famous method of finding relationships) and this is done by studying all cases and their relationship to each other. This developed theory can be used for other places and other databases, and the developed method does not exist before in the literature. Machine learning methods depend on a set of algorithms, and these algorithms are applied to a set of data to build models that help in making decisions. This model is not limited to these data. This model can be used as a solid foundation to address similar problems in different areas. Other factors such as the direction of the wind, the maximum and minimum wind speed per day are important and might serve different applications. In this paper, however, the focus was on the wind speed to achieve a specific goal of providing the network operators with an understanding of the possible productivity of each wind site location, thus facilitating the optimal management and installation of wind plants and network operations. Such other factors open the door for great future work. The wind direction especially will play an important role in determining the place of the wind plants and the layout of wind turbines.
The proposed model shows great promise, so that two locations are sufficient for obtaining the order of preference of the locations. For example, if it is known only that the wind speed in the east region is below 3.05 m/s, then this scenario follows Case 2. Once the cause is known, the order of the wind speed values at all locations can be determined. If the wind speed in the east region is greater than 3.5 m/s but less than 3.72 m/s, the status of the wind speed at the other locations can be extracted from the Case 4 scenario. If the wind speed in the east region is greater than 3.77 m/s, the status of the wind speed at the center location and whether it follows Case 6 or Case 7 can be determined. Indeed, this feature of the proposed model saves the time and effort that would otherwise be required for predicting the wind speed at multiple locations. This model can thus be very helpful to system operators who desire an easy, quick, and accurate method of determining the status of the wind speeds at different locations.

Conclusions
This paper has presented a machine learning-based decision-making method for the assessment of potential wind speed productivity in different locations. To preserve the characteristics of the correlations among these sites, the new method employs wind speed data from multiple sites as a single package. Machine learning using Weka software is then employed to test the correlations among the sites to rank the sites into different cases. Wind speed becomes the primary classification factor for prioritizing the sites in order. The implementation of training tests for big data sets improves the prediction of appropriate locations for wind farms. Using real data, the decision model has been constructed, tested, and verified. The data is a part of the Renewable Resource Monitoring and Mapping (RRMM) Program prepared by King Abdullah City for Atomic and Renewable Energy (KACARE). KACARE monitored and recorded the wind speed data at different installed stations in the Kingdom of Saudi Arabia at 3 m height. 10-fold cross-validation was used in the experimental part. The proposed model shows great results, so that the information about two locations is sufficient for obtaining the order of the remaining locations. The developed model shows high accuracy (up to 95.26%) in the test data. The final performance of Model 1 has been improved by developing Model 2, where the accuracy has increased to 100%. Electric network planners could use the proposed model as a means of enhancing their ability to conduct feasibility studies for any plans for establishing wind farm projects at different distanced locations. A system operator could also use this method for assessing likely wind power productivity at each site so that network operational activities can be managed effectively. The results of this study also offer electricity market investors helpful input for making appropriate investment decisions.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author.