Power Data Preprocessing Method of Mountain Wind Farm Based on POT-DBSCAN

Due to the frequent changes of wind speed and wind direction, the accuracy of wind turbine (WT) power prediction using traditional data preprocessing method is low. This paper proposes a data preprocessing method which com-bines POT with DBSCAN (POT-DBSCAN) to improve the prediction ef ﬁ ciency of wind power prediction model. Firstly, according to the data of WT in the normal operation condition, the power prediction model of WT is estab-lished based on the Particle Swarm Optimization (PSO) Arithmetic which is combined with the BP Neural Network (PSO-BP). Secondly, the wind-power data obtained from the supervisory control and data acquisition (SCADA) system is preprocessed by the POT-DBSCAN method. Then, the power prediction of the preprocessed data is carried out by PSO-BP model. Finally, the necessity of preprocessing is veri ﬁ ed by the indexes. This case analysis shows that the prediction result of POT-DBSCAN preprocessing is better than that of the Quartile method. There-fore, the accuracy of data and prediction model can be improved by using this method.

The operation state of WT is mainly characterized by the power output, and the power curve reflects the performance characteristics and operational features of WT. According to the IEC 61400-12 Standard, the power curve of WT is the relation curve of output power with wind speed change, which is often used to evaluate the operational performance of WT [8,9]. In the actual operation process, there are large amount of abnormal data of wind speed power [10]. Without processing the abnormal data, using the actual measured wind speed power data for analysis will have a huge effect on the results [11].
There are mainly two categories of traditional data cleaning methods with low applicability. The first category is based on mathematical statistics. Shen et al. [12] proposed that under certain wind speed interval conditions, the Quartile method was used to determine the identification boundary of the out-oflimit wind-power data, and then out-of-limit abnormal data in the actual data set were identified and eliminated. Kusiak et al. [13] established a nonlinear model of the power curve to identify and filter abnormal data through the residual method and the control chart. Yue et al. [14] proposed the judgment method of "3r Rule" to detect abnormal data and conduct corresponding filtering, that is, the error outside 3r is an abnormal error but that inside 3r is a normal error. However, the power probability density curve has multiple peaks so its accuracy is not high. Zhao et al. [15] used the 3-sigma Principle, Hampel Identification Implement and Boxplot Rules to filter the data and those did not confirm to the rules were determined as abnormal. Zheng et al. [16] proposed to use the Local Outlier Factor (LOF) algorithm to detect the abnormal data, and calculate the relative density near the curve to filter abnormal data according to the power curve. The second category is based on data mining and machine learning methods to preprocess data. For example, the Fuzzy Inference System (FIS) [17], the Deep Neural Network (DNN) [18] and the Support Vector Machine (SVM) [19] have been widely used. Schlechtingen et al. [20] proposed to build an abnormal data identification model based on neural network and K Nearest method, and filter abnormal data by checking the data consistency. Wu et al. [21] proposed to use the FPCM clustering model to identify and filter abnormal data. They compared FCM and PCM with PFCM, and concluded that the PFCM model has better recognition effect on abnormal data.
Although the above methods such as machine learning can get more accurate data, they did not take into account the impact of the special working environment on the data. At the same time, because machine learning method mainly starts from the characteristics and structure of data, mining the depth and essential characteristics of data, it has been widely used in WT power prediction, with the common methods including SVM [22], neural network [23], gray prediction [24] and so on in recent years. Among them, artificial neural network has the characteristics of strong nonlinear fitting, adaptation and self-learning, and it is especially suitable for predicting wind-power. Such as, Sun et al. proposed the application of GA-ANN model in wind-power prediction [25]. Peng et al. [26] proposed a method for wind-power prediction based on artificial neural network and a hybrid strategy. Although the above direct use of DNN to predict the power can get more accurate results, there are problems of high calculation cost and slow training speed in searching the optimal parameters of the network model, which is not suitable for the nonlinear system with many parameters.
To solve the above problems, this paper proposes a WT power prediction method based on POT-DBSCAN preprocessing. The method not only obtains more accurate data, but also obtains more accurate prediction results by the PSO-BP algorithm, providing an effective early warning research method for the safe and reliable operation of WTs. This research can help the relevant departments of power system to accurately assess the risk of the power grid operation, formulate reasonable power generation plans, effectively reduce the operation cost of the power grid, and greatly promote the development of green energy.

Influence of Wind-Power Data
The WT generator is to convert the changing wind energy into mechanical energy, then transform the mechanical energy to electric energy and transport it to the power network. When the wind speed is between the cut in wind speed and the rated wind speed, wind power can be evaluated by the following equation [27]: where q 0 and R are the air density and the rotor radius respectively, v is the wind speed, and C p is the power coefficient.
As shown in Fig. 1, an obvious abnormal data exists in the actual measurement data. According to the characteristics of abnormal data distribution, the operation data of WT can be divided into outlier data, power limit data, deviating cluster data and abnormal stoppage data [28]. The distribution of outlier data in the scatter diagram is discrete, isolated and far from the power curve, which is due to the large data error caused by the sensor anomalies and failures. The deviation cluster data are distributed on the power curve with high density and small range, which is mainly caused by the electromagnetic interference or computer information processing and storage failure for a long time.
Some abnormal data may also be due to the sudden change of wind speed, resulting in serious wind speed rising or falling edge. The calculation method of 10-min average compression processing will result in the data fluctuating along and deviating from the power curve, which leads to inaccuracy of the data. A period of wind speed rise stage data using the IEC 61400 Standard Bin method is selected to draw the power curve as shown in Fig. 2. It can be seen from the figure that there are some power points that deviate from the power curve through the 10-min average compression processing calculation method. Taking the red point in the figure as an example, with the sudden increase of wind speed, the power is also increasing. Because there are many data with small power value in 10-min, the average calculated power value will be pulled down and located at the bottom of the power curve. So, the data obtained by the 10-min average compression processing method is not accurate in the process of sudden change of wind speed. With the POT method, the power value can be corrected through multiplying the average value of 10-min by the data within the threshold range. Therefore, this paper proposes a data preprocessing method for POT-DBSCAN.

Wind-Power Data Preprocessing Methods
In addition to the data collected by SCADA system, there are also some abnormal data, which are caused by complex environmental changes and sensor failures. Therefore, before modeling and analyzing the WT units, it is necessary to preprocess the data. In this paper, the POT-DBSCAN method is used to preprocess the data.

POT Method
Let the independent identically distributed random variables be X 1 ; X 2 ; Á Á Á ; X n , and FðxÞ is denoted as being subject to some unknown distribution, where a sufficiently large threshold is l. Let the numbers of samples exceeding the threshold be N l . When X i > lði ¼ 1; 2; Á Á Á ; nÞ, X i is called the supra-threshold, and the excess is Y i ¼ X i À l. The excess distribution of the random variable X i is called the conditional excess distribution function.
The distribution of excess numbers is known by the Pickands-Balkama-de Hean Theorem. For a sufficiently large threshold, the distribution function of excess values approximately follows the Generalized Pareto Distribution (GDP). The GDP of the excess distribution is approximately expressed as: when n ! 0, there is y > 0, and when n < 0, there is 0 < y < Àb=n. Where b and n are called Scale Parameter and Shape Parameters respectively. Different n values are going to have different tail-thickness, and the larger the n value is, the thicker the tail is vice versa.
when the threshold is determined, FðlÞ is estimated by the historical data, and FðlÞ is approximately represented by ðn À N l Þ=n, thus the overall distribution function equation can be written as:

DBSCAN Clustering Method
The DBSCAN clustering differs from other clustering algorithms in that it classifies clusters according to the density distribution of data sets. It cannot only deal with noise effectively but also cluster arbitrary shape data. The core principle of DBSCAN clustering algorithm is that every point exists in the database. When the density of the points in the adjacent area is greater than a certain set threshold, the data sets will be added to the adjacent class and then repeated clustering is continued. The DBSCAN clustering method, a typical clustering method based on density, can identify a cluster by setting a density threshold. This clustering algorithm has two key parameters-Eps and Minpts. Eps represents the radius of a cluster, and Minpts is the number of neighbors within the cluster. With references [29,30], Minpts is set to 4 in this study, and Eps is calculated using the following equation: where m denotes the number of objects in the experimental data set, n is the dimensionality of the experimental space, cðÁÞ is the factorial function, and V is the volume of the experimental space formed by m objects: where maxðÁÞ is the largest value function, minðÁÞ is the smallest value function, and x i is the iÀth column data of the mÀbyÀn experimental data matrix.

Wind-Power Data Preprocessing Process
There are various sources of abnormal data of the WT. If the SCADA data containing abnormal data are directly used in the WT power prediction, large errors will be produced. Because the operation state of a WT is easily affected by the abnormal data, it is necessary to preprocess the abnormal data before analysis so as to avoid the influence of abnormal data on the prediction model of the WT. In this paper, the high-frequency data of the wind speed and power are all processed by the POT method to get 10-min data. The specific process is as follows: First, the high-frequency data corresponding to each 10-min data are filtered by the POT method. The filtering rule is taking the average value of each 10-min data which are multiplied by the positive and negative a as the threshold value, where we take a as 0.05. Because in the practical application of engineering, according to the Statistical Significance Test, we can see that a cannot be too small, and taking a ¼ 0:05 is generally more effective. Then, the wind-power data within each 10-min data threshold range are collected, and the average value of the collected power data is calculated to replace the original 10-min average value in this period, so as to generate the new wind-power data. Finally, the DBSCAN method is used to filter the obviously outlier data to get a new data set. The process of wind-power data preprocessing method is shown in Fig. 3.

Particle Swarm Optimization BP Neural Network Algorithm
Although the PSO cannot realize adaptive learning and has slow convergence speed and poor robustness, it has strong global optimization ability. If the PSO-BP neural network algorithm is used, it can not only solve the problem that the BP neural network is easy to fall into local optimization and slow training speed, but also can fill the shortage of the PSO Algorithm. The topology of the BP neural network is shown in Fig. 4. The structure of the BP neural network is set as 6-6-1, and the number of training, the learning rate and the error target are 100, 0.1 and 0.0001 respectively. The input layers x 1 ,x 2 , x 3 ,x 4 ,x 5 ,x 6 of the BP neural network represent the hub speed, wind speed, yaw coefficient, grid power, grid current, generator current respectively, and Y of the output layer represents power. The interpretation of every model parameter is shown in Tab. 1.
The performance of the BP neural network prediction will be greatly affected by the setting of weights and thresholds. The PSO Algorithm can be used for global optimization to find the optimal combination of the neural network weights and thresholds. Then the BP neural network with optimized parameters can be used to predict the unit power, which can better improve the prediction performance. The flowchart of the PSO-BP is shown in Fig. 5, and the specific steps are as follows: (1) The BP neural network is created, and the weights and thresholds of the neural network are initialized.
(2) Set parameters such as the transfer functions, number of training iterations, training error target, learning rate, etc. for the implicit and output layers of the network.
(3) Normalize the input and output data in the training sample.
(4) Set the parameters of PSO algorithm, and randomly generate the position and speed of each particle.
Get wind-power data from SCADA system Processing of high frequency data using the POT method to obtain 10-min data Filtering of apparent outliers in 10-min data using the DBSCAN method New Wind-Power Dataset PSO-BP power prediction model Figure 3: The process of wind-power data preprocessing (5) The fitness function is constructed to optimize the weights and thresholds of the neural network.
(6) The fitness value of each particle is calculated to find the best position of individual and global.
(7) The velocity and position of particles are updated according to the correspondent updating equation.
(8) Increase the number of iterations to determine whether the iteration conditions are met. If they are met, stop and output the optimal weight and the threshold value. Otherwise, go to Step (6).
(9) The optimal weights and thresholds obtained from Step (8) are used to train the BP neural network.

Construction of Prediction Model Based on PSO-BP Algorithm
Based on the above theory, the data are preprocessed by the POT-DBSCAN Algorithm, and then the key parameters in the preprocessed data set are selected as the input of the PSO-BP neural network to establish the power prediction model to predict the power of WT. Accuracy is the most important factor to measure the effect of wind power prediction, and the main indexes of evaluating accuracy are the Root Mean Square Error (RMSE), the Mean Absolute Error (MAE), the Root Mean Square Percentage Error of Relative Error (RRMSE) and the Entropy [31,32].The formulas of these indexes are listed in (8), (9), (10) and (11). Where f i and y i are the predicted value and the actual value respectively, n is the number of the sample data and H a is the entropy.

Input layer Hidden layer
Output layer x 3 x 4 x 5 x 6 Figure 4: The topology of the BP neural network Wind speed x 6 Generator current x 3 Yaw coefficient c 1 learning factor x 4 Grid power c 2 learning factor where x i is a random event that may occur and P i is the probability of the occurrence of the event x i . The flowchart of this study is displayed in Fig. 6 and the main process and steps involved are listed subsequently.

Stage 1: Data Preprocessing
The wind-power data are processed to the high-frequency data by the POT Method, and then the obvious outlier data are filtered by the DBSCAN clustering method.

Stage 2: Establishment of the PSO-BP Prediction Model
The PSO-BP Prediction Model is established by selecting the data of the hub speed, the wind speed, the yaw coefficient, the grid power, the grid current and the generator current in the normal working state of the SCADA system.  Figure 5: Flowchart of PSO-BP neural network algorithm

Stage 3: Verify the Effectiveness of the POT-DBSCAN Preprocessing Method
The preprocessed data are used for the power prediction through the PSO-BP Model, and the necessity of preprocessing is verified by the MAE, RMSE, RRMSE and the entropy error.

Model
The SCADA data used in this study are derived from the 2 MW permanent magnet direct-driven WTs, located in a mountainous wind farm in Southern China. The WT has a diameter of 96 m, a cut-in wind speed of 3 m/s, a rated wind speed of 11 m/s, and a rated power of 2000 KW. The SCADA System records 10-min average values under 1 Hz sampled WT condition and with external environment parameters, including the wind speed, the rotational speed, the current, the power, and so on. The SCADA data of WT No. 1, under the normal operation within 3 to 4 months in 2017, are selected to establish the prediction model. The model has 6-dimensional input and 1-dimensional output, so the number of selected data set is 7200 × 7 groups. According to the size of cut-in wind speed and rated wind speed, the data set is divided into eight parts. In each part, sample data are selected randomly with the ratio of 8:1. And there are totally 900 × 7 groups of sample data. Then 600 × 7 groups of training sample data and 300 × 7 groups of test sample data are selected with the ratio of 2:1. Finally, the accuracy test of the predicted value of the test sample power obtained by the model and the actual value is shown in Fig. 7. The calculated RRMSE was 1.77%, the calculated entropy was 2.69, and the correlation coefficient was 0.996. It can be found that the fitting effect of this model is better, so this model can meet the requirement of precision.  Fig. 8 shows the annual wind-power scatter diagram of WT No. 1 in 2017. First, the high-frequency data corresponding to the 10-min data collected by SCADA system are processed by the POT method. Next, the processed data are saved as a new data set. Then, the new data set is further processed by the DBSCAN clustering method to generate the final data. Fig. 9 is obtained by using the POT-DBSCAN data preprocessing method.

Method Validation
In this experiment, the prediction model is used to predict the power of the WTs. Since there are outlier data, data preprocessing is required to filter all abnormal data points. In recent studies, many excellent data preprocessing methods have been proposed. The Quartile method does not rely on the mean and variance to detect outliers, nor does it need the sequence to follow a certain distribution model. When the proportion of outliers is small, the data identification will be good enough. In this paper, the prediction model of quartile data preprocessing is selected as the benchmark model. In order to verify the superiority and effectiveness of the proposed POT-DBSCAN, the benchmark model is compared with the prediction model after data preprocessing by the POT-DBSCAN. The results are shown in Tab. 2. From the error results in the table, it can be concluded that the prediction effect of the POT-DBSCAN is better. The two methods will be further analyzed later.
Under the condition of normal operation, data from WT No. 1 within the two months after August 10th, 2017 and data form WT No. 2 within the two months before October 6th, 2017 are selected for analysis. The power prediction model of the PSO-BP is established by using the data of the POT-DBSCAN and Quartile preprocessing respectively. A subsequent case analysis also shows that setting a lower threshold for prediction model can more effectively predict an upcoming stoppage of the WTs (the MAE, RMSE, RRMSE and the entropy are 50, 50, 0.03 and 3 respectively, which are the effective lower thresholds for the WTs involved in this study). Fig. 10 shows the power prediction error results of the No. 1 WT two months after August 10. According to the trend of the MAE curve in this figure, the MAE values after the POT-DBSCAN preprocessing are all below 50, while the Mae values after the Quartile preprocessing are higher than 50. Similarly, the RMSE and entropy of the POT-DBSCAN preprocessing were below 50 and 3, while the RMSE and entropy were higher than the threshold after the Quartile preprocessing. After the POT-DBSCAN preprocessing, the RRMSE values were lower than 0.03, and the RRMSE values after the Quartile preprocessing were higher than 0.03. Fig. 11 shows the power prediction error results of the No. 2 WT in the two months before October 6. According to the trend of the MAE curve after the POT-DBSCAN preprocessing, the MAE values are all below 50. According to the trend chart of the root mean square error curve after the POT-DBSCAN preprocessing, RMSE values are all below 50. After the POT-DBSCAN preprocessing, RRMSE and entropy values were less than 0.03 and 3. It can be concluded from this figure that after the Quartile preprocessing, the four indicators are higher than the threshold for some time. These results verify the accuracy of the threshold value. Therefore, it can be considered that the WT operating state is better when the MAE, RMSE, RRMSE and the entropy value are below the threshold value of 50, 50, 0.03 and 3 respectively, which is consistent with the results of the normal operation of WT. The MAE, RMSE, RRMSE and the entropy are all lower than the threshold value from the trend chart of indicators pretreated by the POT-DBSCAN. However, from the trend chart of the MAE, RMSE, RRMSE and the entropy after the Quartile preprocessing, it can be seen that the indexes of some days are obviously higher than the threshold value. Because the normal working state of WT is selected, it is unreasonable to have such a phenomenon. Therefore, according to the above numerical results, the POT-DBSCAN method proposed in this paper is superior to the Quartile method. Both Quartile method and DBSCAN method are used to preprocess the 10-min data after averaging, and do not take the impact of sudden change of wind speed on the high-frequency data into consideration, while the POT method can better process the high-frequency data. Thus, it has also been verified that this data preprocessing method can improve the accuracy of the power prediction and is more effective.

Conclusion
Aiming at the strong nonlinear relationship of the wind-power data caused by frequent changes of wind speed, a wind-power data preprocessing method based on the POT-DBSCAN is proposed in this paper. First, the power prediction model of WT based on the PSO-BP method is established by using the data collected under the normal operation. Then, the accuracy of the threshold is verified by calculating the MAE, RMSE, RRMSE and the entropy. Finally, through the prediction model analysis of the two WTs, it has been concluded that the indexes of the model after data POT-DBSCAN preprocessing are less than the threshold value, which is consistent with the results of the WT data under the normal operation condition. Data from the forecast model indicator trend chart that have been Quartile data preprocessed can be found to be higher than the threshold for a certain number of days. This is contrary to the result of the WT in normal working state. The results show that the POT-DBSCAN is more effective than the Quartile method. Because the Quartile preprocessing method does not consider the impact of frequent changes in wind speed on high-frequency data, but the POT-DBSCAN method can better solve this problem. This paper proposes a method for the WT data preprocessing. The frequent changes of wind speed and direction not only affect the reliability of the WT, but also are not conducive to the accurate use of the SCADA data. The POT-DBSCAN preprocessing method proposed in this paper can improve the accuracy of data and prediction models. However, only the wind speed and power are considered in the current method. Wind direction and temperature also have great influence on the power of WT, which should be take into consideration in further studies.