Alarms-related wind turbine fault detection based on kernel support vector machines

: Wind power is playing an increasingly significant role in daily life. However, wind farms are usually far away from cities especially for offshore wind farms, which brought inconvenience for maintenance. Two conventional maintenance strategies, namely corrective maintenance and preventive maintenance, cannot provide condition-based maintenance to identify potential anomalies and predicts turbines' future operation trend. In this study, a model based data-driven condition monitoring method is proposed for fault detection of the wind turbines (WTs) with SCADA data acquired from an operational wind farm. Due to the nature of the alarm signals, the alarm data can be used as an intermedium to link the normal data and fault data. First, KPCA is employed to select principal components (PCs) to retain the dominant information from the original dataset to reduce the computation load for further modelling. Then the selected PCs are processed for normal-abnormal condition classification to extract those abnormal condition data that are classified further into false alarms and true alarms related to the faults. This two-stage classification approach is implemented based on the KSVM algorithm. The results demonstrate that the two-stage fault detection method can identify the normal, alarm and fault conditions of the WTs accurately and effectively.


Introduction
With the increasing of electricity usage, wind power has become the world's fastest-growing renewable energy source. The wind turbines (WTs) installed capacity has been rising exponentially in past decades. From 2001 to 2017, the worldwide wind power installed capacity has increased from 23,900 to 539,581 MW, and new installed capacity in 2017 was 52,573 MW [1]. Due to the rich and stronger winds in the offshore areas, the installation of the WTs has been moved from onshore to offshore. The location of WTs, especially for offshore WTs, drives the operation and maintenance (O&M) cost to rise significantly. For an offshore WT which has a 20-year lifelong time, the O&M costs can be about 25-30% of the overall energy generation or 75-90% of the investment cost on O&M [2,3]. Besides, the harsh operating environment will bring more difficulties for maintenance. There are two conventional maintenance strategies for the WTs, namely corrective maintenance and preventive maintenance [4]. However, the O&M costs from these two conventional strategies tend to be high when either little failures or a large number of failures occur. Hence, development of a condition-based and intelligent maintenance strategy for wind turbines would be significant and necessary to ensure a reliable, safe and cost-effective operation of the wind power systems. This paper presents research results of a model-based datadriven WT fault detection method, which creates a relationship to identify the false alarms and true alarms related to the faults. The model is performed using the KSVM incorporating the KPCA based on the historical SCADA data. The alarm of the WT system can be triggered when key component signals exceed the predefined threshold limits usually due to design defects, changing of WT running states and components malfunction [5]. Since the alarms could reveal the working conditions of the turbine's components, it can be regarded as a significant index to indicate an early warning of the vital faults. Firstly, the computation load can be reduced by choosing specific principal components (PCs). Secondly, the chosen PCs are used to build the normal-abnormal classification model. Finally, a classification model based on the extracted abnormal data is built to classify the alarms and faults.

PCs analysis (PCA)
The PCA transforms a set of correlated variables into a set of linearly uncorrelated variables, which are the PCs of the original dataset. It has been widely used to visualise relatedness and genetic distance between variables. The process can be achieved by calculating the eigenvalues of the covariance matrix or singular values of non-orthogonal matrix condition [6,7]. PCA has shown its strong capability in dimension reduction and been verified by researches in different fields [8]. By selecting the first few PCs, the major information can be maintained and the dimension of the original dataset is then dramatically reduced. Hence, this technique has been widely applied in feature extraction and incorporated with various machine learning algorithms such as artificial neuron network (ANN) to monitor and predict the performance of WTs [9].
To obtain the PCs from a dataset X with n-by-p dimensions, where p is the number of the variables and n is the number of the samples of each variable, eigenanalysis for the covariance matrix M of original dataset X needs to be performed. First, the dataset X need to be standardised: where x¯j is the mean value of x j , σ x j is the standard deviation of x j , and Z = [z 1 z 2 , …, z p ] is denoted as the standardised dataset with nby-p dimensions. The covariance matrix M of Z is defined as where μ i = E(Z i ) is the mean value of the ith row of Z. The PCs can be derived from the covariance matrix by using singular value decomposition (SVD). The singular values of the matrix M can be calculated by where S is an n-by-p rectangular matrix contains the ith singular values of M. U is an n-by-n matrix called the left singular vectors consists of the n largest eigenvalues of MM T and W T is a p-by-p matrix called right singular vectors associated with the orthonormalised eigenvectors of M T M [10]. By sorting the singular values in descending order and finding their corresponding singular vectors in the same order, the ith PC can be obtained by the following equation: The singular values of M are the variances of their corresponding PCs. Hence, the magnitudes of each singular value represent the weighted information contained in the original dataset. To select the number of PCs, the accumulated variance contributions from each PC need to be calculated. The contribution a i of the variance s i for the ith PC is defined as To obtain the information from the original dataset, the selection of k PCs should be as large as possible while still satisfying k < p. However, the number of PCs must be compromised to achieve dimension reduction. In our study, the accumulated variance contribution is selected no smaller than 85%.

Support vector machine (SVM)
The SVMs are set of supervised learning models that could be applied for regression and classification analysis with associated learning algorithms [11].
Because the original problem might be in a finite-dimensional space and might not be linearly separable in that space, it needs to be mapped into a much higher-dimensional space to make the separation much easier. An n-by-k training dataset Y can be considered as n points in k dimensions, implying each point Y i (i = 1, 2, …, n) contains k PCs. The training process for Y i and its predefined class c i are given in the form below: where c i is either −1 or 1, indicating the class of the point Y i . If any alarms are triggered at time instant i, the class of the Y is assigned to c i = −1; otherwise c i = 1. A hyperplane needs to be found to divide the overall samples into two classes. To satisfy this condition, the hyperplane should follow: The inequality (7) can also be written as where w is weight to the hyperplane and b is the bias. Points Y 0 for which c i (w·Y 0 + b) = 1 are named support vectors [11]. Therefore, the optimal hyperplane is described as This hyperplane is unique that separates the training data with a maximal margin. The distance ρ(w, b) between the projections of the training vectors of two different classes is thus given by The optimal hyperplane (w 0 ,b 0 ) is the arguments that maximise the distance. It follows: The w needs to be minimised to satisfy the constraint defined by (11). The weights w 0 for the optimal hyperplane in the feature space can be written as a linear combination of support vectors where a i 0 is the Lagrangian multiplier, which is to be described in (14).
Thus, the classification of an unknown vector Y is made by transforming a vector to the feature space (Y→ϕ(Y)) and then classified by the sign function: To satisfy the constraints (8), the Lagrangian multiplier is constructed as a standard optimisation technique where Λ T = (α 1 , …, α n ) is the vector of non-negative Lagrange multipliers which satisfy the constraints defined by (8).
With (12), the classification function f(Y) for an unknown vector Y can be extended to

KPCA and KSVM
Both PCA and SVM could only solve linear separable problems. Hence, to solve a larger dataset with a linear inseparable problem, the kernel function is introduced. By using the kernel, the linear operations of PCA are performed in a reproducing kernel Hilbert space. Therefore, the linear inseparable problem can be solved by using kernel function projecting to a higher dimension. KPCA is an extension version of the PCA using the kernel function to perform the originally linear operations in a reproducing kernel Hilbert space. As introduced above, the calculation of PCA can be transferred into the eigenanalysis. By mapping the original data into the feature space using the RBF (radial basis function) kernel. It is defined as where Z is the original input dataset and Z T is its transpose [2]. |Z − Z T | 2 is considered as the squared Euclidean distance between them. The γ is the width of the kernel, which cannot be predicted precisely and has to be constrained by the model or defined by the user [12].
By replacing the original dataset with the kernel, the covariance matrix of (2) can be rewritten as Then, following the same procedures as described by (3) and (4), the singular values and vectors, the kernelised PCs can be obtained. Similar to the KPCA, the solution of KSVM also involves the transformation of the input dataset. In this case, the selected PCs are employed, and the kernelised classification function, as derived from (15) can be written as where Y is the input data, which needs to be classified and Y 0 are the support vectors.

SCADA data
The SCADA system is a data acquisition and control system that is used for high-level supervisory management through computers, graphical user interfaces and network data communications [13].
The SCADA data used in this paper were acquired from an operational wind farm which consists of 26 turbines over 12 months. To test and validate the proposed classification model, it is necessary to use historical data from an operational wind farm. Unlike the high-frequency condition monitoring data, SCADA data have a low sampling rate usually at 10 min/sample to reduce data storage amount while still maintaining the vital information about the operation and performance of the wind turbines [4]. The monitoring variables for each turbine consist of 128 readings among various types of physical and electrical signals, such as temperatures, pressures, power outputs and control signals. Preprocessing to the data is essential for further analysis due to the occasions that the turbines are in inactive during the periods of low and high speeds. Besides, the digital and constant data need to be removed to prevent inferences to the processing [14,15]. As examples, Figs. 1-3 show the wind power curve of three different turbines. For wind turbines, the S-curve refers to the relationship between the output power and wind speed [16]. The output power would often be reduced when the fault occurs to prevent the fault from being developed into the detrimental one. The dashed box indicates the fault area. As can be observed from the figures, the turbine with a generator winding fault has a shorter time period of fault exposure compared to the turbine with a gearbox bearing fault.
To detect the faulty condition of the wind turbine, a two-stage classification method is proposed, as illustrated in Fig. 4. By checking time-series data, the original dataset includes data under the normal working condition and those alarm data. The alarm data also contain the fault data related to the alarms triggered during the fault period. Then abnormal data are further classified into the true positive signals, indicating the occurrence of a real fault, and false positive signals, which can be considered as a warning.
Three normal data selection methods are used in our study. The first one is to choose the first 5000 samples in the original dataset, which is referred as to method 1. The second method is to choose 2500 samples before and after the fault respectively, which is referred as to method 2. The last method is to choose 5000 samples randomly among the normal data, which is referred as to method 3. The fault detection method is then applied to both faulty turbines, as shown in Figs. 2 and 3. The results given in the next section are based on the turbine with a gearbox bearing fault with the normal data being selected using method 3.

Monitoring variable selection
After pre-processing the original data by removing those control and DC signals, there are 78 variables in total remaining for further data dimension reduction. All the data samples relating to the fault are selected and processed with KPCA. To select the appropriate PCs, the variance contribution of each PC needs to be calculated, as given in Table 1. 16 PCs are therefore selected to meet the requirement of achieving 85% accumulated variance contribution.

Normal-abnormal condition classification
The selected PCs will be further processed by KSVM. Since KSVM is a supervised learning algorithm, the dataset needs to be divided into two groups, the data under normal conditions and the data under abnormal condition (formed by false alarms and true alarms related to the fault). Since it is impossible to plot 16dimensional graph form the selected 16 PCs, all the results will be plotted in 2D space about wind speed and active power. Fig. 5 gives an example of the data needing to be processed for normal-abnormal classification, where the blue dots represent the normal data and red crosses represent the abnormal data.
As mentioned above, to process the data using the KSVM algorithm, the linear inseparable data in a lower dimension can be projected into a higher dimension and thus differentiated by a hyperplane. As an example, Fig. 6 shows the working principle of the KSVM, where the blue dots represent the normal data and red dots represent the abnormal data. The support vectors are labelled by green circles while the fitted hyperplane is demonstrated in where x, y are the wind speed and active power, respectively. The coefficient of determination r 2 is used to evaluate the accuracy of the fitting and the value of this fitted plane is 0.8605. During this process, 70% of the data were used as the training set and 30% of the data were used for validation. The validation result is displayed in Figs. 7 and 8. In Fig. 7, the normal data classified as normal are shown in blue dots while the normal data classified as alarm are shown in blue crosses; the alarm data classified as alarm are shown in red dots and alarm data classified as normal data are shown in red crosses. Fig. 8 shows the confusion map of the normal-alarm classification result, which is used to evaluate the performance of the algorithm. The white areas show rates of both normal and alarm data were predicted correctly and the yellow areas show the misclassified data. As can be observed from the figure, the predicted normal data have reached 99.9% true and alarm data have reached 90.9% true, leading to a total accuracy of 99.4%.

Alarm-fault classification
After the procedure of normal-alarm classification, the alarm-fault classification is then processed. Fig. 9 shows the alarm-fault classification in the relationship between wind speed and active power. The blue dots represent alarm signals and red dots represent for fault signals. The support   To examine the robustness of the proposed methods, more turbines are tested with different SCADA data selection methods. It can be observed from Table 2 that the performances of the turbine with generator winding fault are not as good as the turbine with gearbox bearing fault. This might be due to the insufficient samples

Conclusion
With these alarm signals being identified, the fault can be warned at an early stage, which leaves sufficient time for maintenance scheduling. According to the results, several conclusions are drawn as follows: • To select PCs of the monitoring variables, the accumulated variance of the PCs can be regarded as the most significant factor. However, to maintain the most information of the original dataset, the computation load needs to be compromised. • Compared with other machine learning algorithms, the SVM has its strength in solving the two-group classification problem. Compared with the decision tree and discriminant analysis algorithms, the SVM demonstrates more accurate results. • In terms of sample data selection, the turbine, which has a large amount of abnormal data, shows a better classification performance, indicating the influence of the sample selection. • The KPCA can reduce the dimension in an acceptable range while the KSVM demonstrates excellent results for the twostage classification.
Further work will be focused on the examination of the proposed approach incorporating with deep learning algorithms and verification of the results with more data from both simulations and physical test rig.