Attention Mechanism-Based CNN-LSTMModel for Wind Turbine Fault Prediction Using SSN Ontology Annotation

The traditional model for wind turbine fault prediction is not sensitive to the time sequence data and cannot mine the deep connection between the time series data, resulting in poor generalization ability of the model. To solve this problem, this paper proposes an attention mechanism-based CNN-LSTM model. The semantic sensor data annotated by SSN ontology is used as input data. Firstly, CNN extracts features to get high-level feature representation from input data. Then, the latent time sequence connection of features in different time periods is learned by LSTM. Finally, the output of LSTM is input into the attention mechanism module to obtain more fault-related target information, which improves the efficiency, accuracy, and generalization ability of the model. In addition, in the data preprocessing stage, the random forest algorithm analyzes the feature correlation degree of the data to get the features of high correlation degree with the wind turbine fault, which further improves the efficiency, accuracy, and generalization ability of the model. The model is validated on the icing fault dataset of No. 21 wind turbine and the yaw dataset of No. 4 wind turbine. The experimental results show that the proposed model has better efficiency, accuracy, and generalization ability than RNN, LSTM, and XGBoost.


Introduction
In recent years, with the development of human beings, the exploitation and utilization of petroleum and fossil fuels have promoted the development of various fields. However, due to the over exploitation of these resources, the nonrenewable energy is reduced gradually, which makes the energy crisis and climate change become an urgent problem [1]. Countries of the world begin to turn their attention to renewable energy. Wind energy as a renewable energy has the characteristics of clean and easy to use, which has aroused great concern in the world. At present, although wind energy develops rapidly, how to predict accurately the occurrence of wind turbine fault and reduce effectively the maintenance cost of wind farm are still urgent problems to be solved.
Sensor technology has been applied widely in various fields. A large number of sensors are also installed on the wind turbine equipment to collect the operation data of the wind turbine. Based on the collected mass operation data, wind turbine state data analysis method is used to predict the fault of wind turbine. This method analyzes the state data of the wind turbine from multiple layers and excavates the potential valuable fault information of data by intelligent algorithm, realizing the fault prediction of the wind turbine. In recent years, there have been some related research results. There are mainly fault prediction models based on traditional machine learning methods. Kusiak and Verma [2] used data mining technology to analyze the bearing overtemperature fault and established a bearing fault prediction model to predict the fault. Although the overtemperature fault can be predicted accurately, the false alarm rate is relatively high, which increases the burden of maintenance personnel. Kusiak and Li [3] layer upon layer processed SCADA data of wind turbine. Based on the processed data, the wind fault prediction model was constructed and predicted successfully for wind turbine fault, but the accuracy of the model needs to be improved. Zhong et al. [4] proposed a rapid data-driven fault diagnostic method, which integrates data preprocessing and machine learning techniques. In this method, fault features are extracted by using the modified Hilbert-Huang transforms and correlation techniques. Pairwise-coupled sparse Bayesian extreme learning machine is applied to build a real-time multifault diagnostic system. The experimental results show that the method can diagnose quickly the fault, but the generalization ability of the model is poor. Chen and Zhang [5] proposed a fault prediction method of wind turbine blade cracking based on RF-LightGBM algorithm. In this method, firstly, random forest algorithm ranks the importance of features to select important features. Then, the classification model is trained by the data after feature selection and optimized by K-fold cross validation. Finally, the method predicted successfully the wind turbine blade fault. Wang et al. [6] put forward the XGBoost algorithmbased fault prediction model of wind turbine main bearing. Wu et al. [7] presented a fault diagnosis method of wind turbine based on ReliefF and XGBoost algorithm. Hsu et al. [8] proposed a novel fault diagnosis technique for wind turbine gearbox. While statistical process control was applied to fault diagnosis, random forest and decision tree algorithms were employed to construct the predictive models for wind turbine anomalies. Fan and Tang [9] proposed a wind turbine pitch anomaly recognition system based on AdaBoost-SAMME. The proposed models in literature [5][6][7][8][9] have high accuracy, but the generalization ability of these models needs to be improved. Wang et al. [10] proposed a fault diagnosis method of wind turbine gearbox based on Riemannian manifold, which is fast in model training, but the accuracy of the model is not high. Zhang et al. [11] proposed a wind turbine fault diagnosis method based on the Gaussian process metamodel, which has excellent performance. But the model has high dependence on dataset, which leads to poor generalization ability of the model.
With the development of deep learning, there are many models for wind turbine fault prediction based on deep learning. Chen et al. [12] proposed a fault diagnosis method of wind turbine gearbox based on wavelet neural network. This method can predict faults accurately. But it cannot learn deep features in data. Lu et al. [13] proposed a wind turbine planetary gearbox fault diagnosis method based on self-powered wireless sensor and deep learning. This method has high accuracy, but it cannot learn the temporal relationship between data. Kavaz and Barutcu [14] proposed a wind turbine sensor fault detection method based on artificial neural network. This method combines with SCADA system for data acquisition, which makes the model have the advantages of low cost, but the model lacks the ability to process time series data. Chen et al. [15] proposed a model of the relationship between root cause and symptom of wind turbine fault based on Bayesian network to predict wind turbine fault. However, the complexity of the model increases exponentially with the increase of parent nodes of Venn diagram, which makes the training time too long. Shi et al. [16] proposed an intelligent fault diagnosis method of wind turbine bearing based on convolution neural network. This method can extract features effectively, but the accuracy of the method needs to be improved. Zhang et al. [17] proposed a fault diagnosis method of wind turbine gearbox based on 1DCNN-PSO-SVM. This method has a good performance in feature extraction, but it lacks the ability to learn time series features. In order to solve the problems of low efficiency and poor accuracy of manual detection method for wind turbine blade defect, Zhang and Wen [18] proposed an improved Mask R-CNN detection method for wind turbine blade defect. Firstly, ResNet-50 combined with FPN is used to generate feature map. Then, it is input into RPN to screen out the ROI. Finally, the size of the feature map is determined by ROIAlign, which is input into the full connection layer network for blade defect detection. The experimental results show that this method has fast detection efficiency and high accuracy. However, this method only considers the characteristics of a single data graph and does not mine the temporal relationship between data. Chang et al. [19] proposed a concurrent convolution neural network for fault diagnosis. The method has high accuracy and strong generalization ability. However, the network structure parameters and calculational speed of this method need to be further optimized. Jiang et al. [20] proposed a fault diagnosis model of wind turbine gearbox based on multiscale convolution neural network. The proposed MSCNN incorporates multiscale learning into the traditional CNN architecture. It improves greatly the feature learning ability and diagnosis performance of the model. But the model cannot learn the long-term correlation between the data. Yin et al. [21] proposed a temperature fault early warning method for wind turbine main bearing based on Bi-RNN, but there are problems of gradient disappearance and gradient explosion for processing time series data. Yin et al. [22] proposed a fault diagnosis method of wind turbine gearbox based on the optimized LSTM neural network with cosine loss. However, this method cannot learn features directly from the original vibration sign. Zheng et al. [23] put forward the fault prediction method for wind turbine gearbox based on K-means clustering and LSTM. Although it can process effectively time series data, it inputs directly the original features of the wind turbine into the model for training, resulting in model training too long. On this basis, a fault prediction method for wind turbine based on deep trust network was proposed, which has a good effect on feature extraction of wind turbine data, but it has the problem of big error [24]. Lei et al. [25] presented an end-to-end LSTM model for fault prediction. The model uses the data fusion strategy of IDENTITY function to extract multisensor features. LSTM captures longterm dependencies through recurrent behaviour and gates mechanism. The experimental results show that this method is able to do fault classification effectively from raw time series signals collected by single or multiple sensors. However, the strategy used in the data fusion stage of this method needs to be improved.
In accordance with shortcomings of traditional machine learning in processing time series data and advantages of CNN [26] in feature extraction and LSTM [27] in processing time series data, an attention mechanism-based CNN-LSTM model for wind turbine fault prediction is constructed. The contributions of this method are summarized as follows: (i) This method proposes a joint training mode of CNN and LSTM, which can process effectively the time series data and ensure the training speed. In addition, 2 Wireless Communications and Mobile Computing considering that the irrelevant features of data will lead to the degradation of the model performance, therefore, attention mechanism [28] is used to redistribute feature weights to ensure the performance of the model and improve the generalization ability of the model (ii) In order to further eliminate the influence of irrelevant features and improve the construction speed of the model, random forest algorithm [29] is used to select features in the data preprocessing stage, which further improves the generalization ability of the model The rest of this paper is organized as follows. In Section 2, the data acquisition process of wind turbine is introduced. Section 3 introduces the model of this paper. Section 4 presents the experimental procedure and experimental results. Section 5 is the conclusion including summary of the method and future work.

Data Acquisition
In view of the isomerization of the data transmitted by the sensor in the form and description information, in this paper, SSN ontology annotation method is used to format and unify sensor data of wind turbine. The specific process of ontology annotation method is shown in Figure 1.
Firstly, the method needs to analyze the sensor data of wind turbine to find its concept and properties. Secondly, it is necessary to analyze the structure of SSN ontology and compare sensor data with SSN ontology structure to extract useful information. Based on these, the annotation information of sensor data can be obtained by the semiautomatic annotation mapping method. Then, the annotation information of data is generated into a mapping file in XML format by R2RML mode. The mapping file is used mainly to store the annotation information of wind turbine sensor data. Then, the mapping file of wind turbine sensor data is transformed into ontology instance to generate RDF semantic sensor data by the semantic conversion algorithm of sensor data. Finally, the RDF semantic sensor data is stored in HBase by Spark Streaming.
Based on the ontology semantic annotation method of SSN, SCADA system is used to collect data. The icing fault dataset and yaw fault dataset of a wind farm in Yunnan Province are collected. According to the SSN ontology structure diagram as shown in Figure 2, the corresponding wind turbine features are generated. Among them, the icing fault data involves 26 features, as shown in Table 1; the yaw fault data involves 28 features, as shown in Table 2. 3. Introduction to the Model 3.1. Model Structure. The structure of the attention mechanism-based CNN-LSTM wind turbine fault prediction model is shown in Figure 3. The model can be divided into three parts. In the first part, the processed data are input into CNN and the high-level feature representation is obtained by feature extraction of CNN. In the second part, the output of CNN is input into LSTM; the deeper temporal relationship between features is obtained by LSTM. In the third part, attention mechanism is introduced to acquire more related information of fault. This model is also called as the CLA model.
For a given training dataset D = fðx i , y i Þ | i = 1, ⋯, pg, x i represents the input sample data, x i ∈ R q , where q is the number of features, and y i ∈ f0, 1g is the label of the i-th sample data. The sample data is input into CNN, and the highlevel feature representation is obtained by extracting the original features: where ω is the convolution kernel, g is the size of convolution kernel, i : i + g − 1 is the i-th feature to the ði + g − 1Þ-th feature, and b is the offset term. After the calculation of convolution layer, the characteristic matrix J is acquired: Then, the maximum pooling technique [30] is used to process the local characteristic matrix J of the fault to retain the key information of the features and reduce the parameters; finally, the local optimal solution is as follows: Then, the M vector is connected to form the H vector by the full connection layer: The output H of convolution network is taken as the input of LSTM. After receiving the output of CNN, the forgetting gate, memory gate, and output gate are calculated by the hidden state h t−1 of the last time and the current input x t to form a memory unit, which runs through all processes. It can retain important information and remove unimportant information; the specific process is as follows: (1) Firstly, the data passes through the forgetting gate.
The sigmoid unit in the forgetting gate generates a vector between 0 and 1 by calculating h t−1 and x t . According to this vector, LSTM can determine what information needs to be retained and what information needs to be discarded in memory unit C where f t is the forgetting gate, σ is the activation function, and h t−1 and x t are the inputs.
(2) The next step is to determine which part of the new information is added to the memory unit, which involves two operations. Firstly, input gate is obtained by calculating h t−1 and x t ; input gate can determine which part of the information needs to be updated. Then, let h t−1 and x t pass through a 3 Wireless Communications and Mobile Computing where i t is the input gate, tanh is the activation function, andC t is the candidate memory unit.
(3) LSTM will update memory unit; memory unit C t−1 will be updated to memory unit C t . The process of updating is as follows: firstly, forget some information of old memory unit through the forgetting gate. Secondly, increase part of the information of candidate memory unitC t through input gate. Finally, get a new memory unit C t where C t is the new memory unit.
(4) Finally, the LSTM determines which state characteristics need to be output through the output gate; let the input h t−1 and x t pass through the sigmoid layer of the output gate to get a judgment condition, and then let the memory unit pass through the tanh layer to get a vector between -1 and 1, which is multiplied by the judgment condition of the output gate to get the final output of the memory unit where O t is the output gate and h t is the final output.
In the appeal of Equations (1) to (9), (5) The data is processed by LSTM to get the output H t = ½h 1 , ⋯, h t of each time step; then, input H t into the attention module. Through the attention mechanism, different weights can be assigned to the fault characteristics of wind turbine. The formulas are as follows: where h t is the input and w t is the target weight.
Then, the softmax function is used to probabilistically affect the attention weight: where P t is the weight probability vector. The generated attention weight is assigned to the corresponding hidden layer state code h t : where a t is the weighted average of h t and P t is the weight.

Wireless Communications and Mobile Computing
Finally, the results are output through the full connection layer: where y t is the predication results, W final is the weight matrix, and σ is the sigmoid activation function.

Model
Training and Testing. Algorithm 1 shows pseudocode of CLA model algorithm. Firstly, all the trainable parameters in the model are initialized. Secondly, the training dataset D = fðx i , y i Þ | i = 1, ⋯, pg is put into CNN through the input layer of the model. CNN uses convolution layer and pool layer to get the deep feature matrix X = ½X 1 , ⋯, X i of data. Then, the output of CNN is used as the input of LSTM. LSTM can further learn the feature of wind turbine data at multiple continuous times to obtain the time series dependence between features. The output of each time step for LSTM recorded as H t = ½h i , ⋯, h t is input to the attention mechanism module to redistribute the feature weight. Finally, according to the weighted features a t , the full connection layer generated the prediction results y. According to the error between the predicted results and the real values, the model updates the trainable parameters. The model judges whether the training times reach the upper limit of the maximum number of iterations. If not, the number of iterations is increased by 1 and the training process is repeated. Otherwise, the trained model can be obtained. The training and testing flow of the CLA model is shown in Figure 4. Firstly, the processed data are divided into training dataset and test dataset. Secondly, the training dataset is input into the constructed model and the features of data are extracted by convolution neural network. Then, the results are input into LSTM for time series feature transformation; the fault characteristics of each time segment are integrated into a unified sequence fault features; besides, Input: Status data of wind turbine after processed: D = fðx i , y i Þ | i = 1, ⋯, pg Output: Fault prediction results of wind turbine.
• FC is the full connection layer; • max is the maximum pool layer; • σ is the sigmoid activation function; • ω is the convolution kernel, g is the size of convolution kernel, i : i + g − 1 are the features from i to i + g − 1 and b is the offset term; • W o , W f is the weight matrix, b o is the offset vector, h t−1 is the hidden state of the previous time, x t is the input of the current time, C t is the memory unit of the current time; • y is the prediction result of the model and y ðiÞ is the real label of the sample. Training: Initialization: Initialize all parameters in the model; For p in n do: (1) Deep level feature is extracted based on CNN: X i = FCfmax fσðω × x i:i+g−1 + bÞgg; (2) LSTM integrates the fault characteristics of each time segment into a unified sequential fault feature: The attention mechanism allocates the weights and the full connection layer generates the prediction results:a t = ∑σðtanh ðH t ÞÞ * h t , y = FCðW f · a t Þ; (4) Calculate the loss value of the model L = ∑ N i=1 y ðiÞ log y′ ðiÞ + ð1 − y ðiÞ Þ log ð1 − y′ ðiÞ Þ, then adjust the model parameters. End.

Experiment and Analysis
The construction of the CLA model is based on Keras deep learning framework. In the experiment of wind turbine icing fault prediction, a two-layer convolution neural network and a two-layer LSTM neural network are used, where the number of neurons in each layer of convolution neural network and LSTM neural network is set to 64. In the experiment of wind turbine yaw fault prediction, a two-layer convolution neural network and a two-layer LSTM neural network are also used, where the number of neurons in each layer of convolution neural network is set to 32 and the number of neurons in each layer of LSTM neural network is set to 16. In order to prevent overfitting, dropout is introduced into the model and the value of dropout is set to 0.5.

Selection of Evaluation
Indexes. In the CLA model, sigmoid is used as the activation function and binary_crossentropy is used as the loss function. We select the commonly used evaluation indicators in the field of fault prediction, including accuracy (A), precision (P), recall (R), and F1 value (F1).   wind_direction  wind_direction_…  yaw_position  yaw_speed  pitch1_angle  pitch2_angle  pitch3_angle  pitch1_speed  pitch2_speed  pitch3_speed  pitch1_moto_tmp  pitch2_moto_tmp  pitch3_moto_tmp  acc_x  acc_y  environment_tmp  int_tmp  pitch1_ng5_tmp  pitch2_ng5_tmp  pitch3_ng5_tmp  pitch1_ng5_DC pitch2_ng5_DC pitch3_ng5_DC Figure 7: Importance of icing fault characteristic. Accuracy refers to the ratio between the number of samples classified correctly by the model and the total number of samples for a given test dataset: Precision refers to how many of the samples predicted as positive by the model are real positive samples: Recall rate indicates how many positive samples in the dataset are predicted correctly: F1 value represents the combined value of precision and recall rate. Because P and R are a pair of contradictory variables, it is difficult to simultaneously improve them in the experimental process. So, it is necessary to comprehensively evaluate them. Therefore, F1 value is proposed to reconcile them; F1 value is the comprehensive average of P and R. The closer the precision rate and the recall rate are, the greater the F1 is: In Equations (14)- (17), TP is the number of samples that predict the wind turbine samples in normal state as normal state, FP is the number of samples that predict the wind turbine samples in fault state as normal state, FN is the number of samples that predict the wind turbine in normal state as failure state, and TN is the number of samples that predict the wind turbine in fault state as failure state. In order to verify the effectiveness of the proposed wind turbine fault prediction model in dealing with time series data and generalization problems, two groups of experiments are carried out, using RNN [21], LSTM [25], and XGBoost [6] algorithm as a comparison.
The first group of experiments, the data of No. 15 wind turbine and No. 3 wind turbine, is, respectively, divided into training dataset and corresponding test dataset. Two groups of training datasets are used directly to train the CLA model and other reference models; then, the corresponding test datasets are used to verify the model.
The second group of experiments, the two CLA models and other reference models trained in the first group, was tested on the data of No. 21 wind turbine and No. 4 wind turbine; the experimental results were compared and analyzed. Figures 5  and 6, Logloss and ClassError of the CLA model on the dataset of Experiment 1 are shown, respectively. It can be found from the graph that Logloss and ClassError decrease continuously during the process of model iteration, which indicates that the model is converging with the increase of iteration times.

Experimental Results and Analysis. As shown in
On the two datasets of Experiment 1, the results of the CLA model and other comparison algorithm models are shown in Table 3 Table 4. On the ice fault dataset of No. 21 wind turbine, the accuracy rate of the CLA model is 74.92%, which is 2.50%, 6.99%, and 4.54%, respectively, higher than the LSTM, RNN, and XGBoost algorithm models, and that of the other three indicators P, R, and F1 of the CLA model is also the best. On the yaw fault dataset of No. 4 wind turbine, the accuracy rate of the CLA model is 77.09%, which is 2.54%, 4.61%, and 6.71%, respectively, higher than the LSTM, RNN, and XGBoost algorithm models, and that of the other three indicators P, R, and F1 of the CLA model is also the best.
According to the experimental results of Experiment 1 and Experiment 2, the CLA model has the best performance in A, P, R, and F1 compared with the LSTM and RNN algorithm models; the results show that the temporal relationship of fault features can be learned to improve the accuracy and generalization ability of the model by fusion of CNN and LSTM and introducing attention mechanism. It proves the effectiveness of the CLA model. However, on the test dataset of No. 15 and No. 3 wind turbine, the accuracy rate of the CLA model is reduced, respectively, by 1.59% and 0.34% compared with the XGBoost algorithm model. Because the CLA model needs more data for deep learning compared with XGBoost, due to the lack of data, the model is over fitted, which makes the performance of the CLA model worse than the XGBoost algorithm model on the test dataset of No. 15 and No. 3 wind turbine, but the CLA model is more able to mine the deep relationship of time series data, so the generalization ability of the CLA model is better than XGBoost.

Optimization of CLA Model
In order to further improve the generalization ability of the CLA model, the original features of the icing fault dataset of No. 15 and No. 21 wind turbine are input into the random forest algorithm; the original features are screened through many experiments. Finally, the histogram of feature importance is obtained as shown in Figure 7. Based on this histogram, 12 features of the highest correlation degree with the icing fault of the wind turbines are selected, as shown in Table 5.
In the same way, random forest algorithm is used in the yaw fault dataset of No. 3 and No. 4 wind turbine; the feature importance histogram is obtained as shown in Figure 8. Based on this histogram, 10 features of the highest correlation degree with the yaw fault of the wind turbines are selected, as shown in Table 6.
Two feature datasets of No. 15 and No. 3 wind turbine screened by random forest algorithm were input into the CLA model for training; the trained model is used to test the feature dataset of No. 21 and No. 4 wind turbine screened by random forest algorithm for verifying the screening effect. The experimental results are shown in Table 7. For the icing fault dataset of No. 21 wind turbine after screening, the CLA model is still the best and the accuracy rate is improved by 2.84%. For the yaw fault dataset of No. 4 wind turbine after screening, the CLA model is still the best and the accuracy rate is improved by 9.13%. It shows that feature screening by random forest algorithm can improve effectively the generalization ability of the model.

Conclusions
In this paper, an attention mechanism-based CNN-LSTM model for wind turbine fault prediction is proposed. The model is trained by the icing fault dataset and yaw fault dataset of wind turbine annotated by SSN ontology. The trained model can predict accurately the occurrence of wind turbine fault. The experimental results show that the model is more effective and better than some of the current mainstream models. In this method, the model can learn effectively the deep-seated temporal characteristics among samples and focus on the features of high correlation with wind turbine fault through the joint training of CNN-LSTM and the introduction of attention mechanism. These strategies improve successfully the accuracy and generalization ability of the model. In addition, the generalization ability of the model can be further improved by using the random forest algorithm in the data preprocessing stage. In the later stage, we will continue to study how to improve the model so that the prediction accuracy and generalization ability of the model can be further improved. Besides, we will also research how to introduce more excellent machine learning algorithms into the model for adapting to the complex working environment of wind turbine; it means that the model will be applied truly to practical engineering.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.