Novel Intelligent Method for Fault Diagnosis of Steam Turbine Based on T-SNE and XGBoost


 For fault failures of a steam turbine occur frequently and cause huge losses, it is important to identify the fault category. A steam turbine clustering fault diagnosis method based on t-distribution stochastic neighborhood embedding (t-SNE) and extreme gradient boosting (XGBoost) is proposed. Firstly, the t-SNE algorithm is used to map high-dimensional data to low-dimensional space, and data clustering is performed in low-dimensional space. Combined with the fault records of the power plant, the fault data and health data of the clustering result are distinguished. Then, the imbalance problem in the data is processed by the synthetic minority over-sampling technique (SMOTE) algorithm to obtain the steam turbine characteristic data set with fault labels. Finally, we used the XGBoost to solve this multiclassification problem. In the experiment, the method achieved the best performance with an overall accuracy of 97% and early warning at least two hours in advance. The experimental results show that this method can effectively evaluate the state and make fault warning for power plant equipment.


Introduction
Thermal power plays an important role in the total power generation. Thermal power generation consumes enormous disposable energy of coal, resulting in a shortage of coal energy. In order to save energy consumption, reduce pollution and protect the environment, thermal power enterprises should adopt advanced scientific and technological means to reduce energy efficiency loss, strengthen research on fault diagnosis of main power generation (e.g. steam boilers and turbines). Utilization rate of material and energy to ensure sustainable development of thermal power enterprises [1][2][3].
In recent years, the rapid development of information technologies, computer technologies, and other new technologies bring the new progresses in condition monitoring and fault diagnosis of equipment [4][5]. The power plant uses big data technology to deeply mine data value [6][7], which also makes itself more optimized, safer, and more economical. And the application of machine learning in intelligent diagnosis has achieved good results. The main machine learning algorithms include support vector machines (SVM) [8][9] and its improved algorithm, Decision Tree and its improved algorithm [10][11], Artificial neural network (ANN) and its improved algorithm [12][13], etc. These algorithms can get better classification results for data sets with a large number of fault tags. W Deng et al. [14] used the improved particle swarm optimization (PSO) algorithm to optimize the parameters of least squares support vector machines (LS-SVM) in order to construct an optimal LS-SVM classifier, which is used to classify the fault. In H Sun's research [15], a fault diagnosis method based on wavelet packet analysis and support vector machine was proposed. Firstly, the wavelet packet transform was used to decompose and denoise the signal, and the original fault feature vector was extracted for reconstruction. The improved support vector machine algorithm was used to diagnose the fault based on the new fault feature vector. G

·3·
Ma et al. [16] proposed an intelligent fault diagnosis model for power equipment based on case-based reasoning (IFDCBR). The model intends to discover the potential rules of equipment fault by data mining.
The steam turbine is one of the important equipment in thermal power plants [17]. The accurate fault diagnosis can find the fault in time, repair in advance, and ensure normal production. A Thermal power plant collects a lot of data during steam turbine operation and maintenance, such as condition monitoring data, defect data, etc. Accumulated a large number of steam turbines in power plant automation system data, contained in the data about the characteristics of the steam turbine fault state.
However, because of the data acquisition and artificial records, fault records cannot be directly associated with the automatic acquisition of time-series data. More seriously, due to the low efficiency and low quality of manual recordings, the sample data with a large number of labels cannot be directly obtained. Moreover, the turbine has high reliability and is in normal operation for a long time, which makes it difficult to provide a large number of fault sample data. Because the signals collected by the automatic system are nonlinear and non-stationary, the fault features are often drowned by external factors such as noise, the traditional signal processing, and analysis technology is greatly limited. Therefore, an effective method for feature extraction and fault diagnosis of steam turbine equipment is needed for this condition.
According to the analysis of the latest research progress, a fault diagnosis method based on t-distribution stochastic neighborhood embedding (t-SNE), K-means clustering, synthetic minority over-sampling technique (SMOTE) and extreme gradient boosting (XGBoost) is proposed in this paper. For the vibration signal collected by samplers, which has a high dimension and the data cannot be visualized, t-SNE is used for dimensionality reduction. Since the data collected in the actual project is unlabeled, the K-means clustering, combined with fault record, is used for fault identification. When the Steam Turbine is early failed and detected by the model in this paper, this warning information will be fed back to the thermal power plant owners (or managers, or controllers). This early warning information of early fault gave them time to deal with the problems. During this time, they could use other methods to reasonably arrange when to take measures.
The remaining part of this paper is organized as follows. Section 2 is about methods. In Section 2.1, the performance indicator extraction based on t-SNE and K-means are introduced and discussed. Section 2.2 introduces the imbalanced data recognition model based on SMOTE and XGBoost. A model assessment method is introduced in Section 2.3. Section 3 provides the data experiment and results of the proposed method. Finally, conclusions are addressed in Section 4.

Performance indicator extraction based on t-SNE and K-means
The t-SNE is a nonlinear dimensionality reduction algorithm that maps multi-dimensional data to two or more dimensions by the similarity of high-dimensional data [18][19]. It has been applied to many fields, including image processing [20], genetics [21], and materials science [22]. In this paper, the input of t-SNE is signal features extracted by data acquisition equipment. According to the similarity of signal features, these features are further reduced. The main algorithm is as follows.
(1) The conditional probability distribution are calculated as follows. 2 2 where i is the gaussian distribution variance centered on i x .
(2) Calculating the joint probability density ij p of the high dimensional samples.
(3) Calculating the joint probability density ij q of the low dimensional samples.
(4) Calculating the loss function C and its gradient. C is defined by the KL distance to evaluate the similarity degree of joint probability density ij p and ij q .
where t is the number of iterations, is the learning rate, · is the momentum factor. (6) Returning to (4) and (5) until the number of iterations is reached.
After obtaining the low-dimensional data output by the t-SNE algorithm, the K-means clustering algorithm [23] was used to classify the data into two categories. This algorithm is used to classify the fault danger interval. When a single fault danger interval is identified, the data is divided into fault data and normal data. However, the lack of failure records leads to an imbalance problem.

Imbalanced data recognition model based on SMOTE and XGBoost
Sampling methods are very popular in balancing the class distribution. Over and under-sampling methodologies have received significant attention to counter the effect of imbalanced datasets. SMOTE algorithm is simple and efficient, has a good anti-noise ability, and can improve the generalization of the model [24][25]. The formal procedure works as follows.
The minority class is over-sampled by taking each minority class sample and introducing synthetic examples along the line segments joining any/all of the k minority class nearest neighbors. Depending upon the amount of over-sampling required, neighbors from the k nearest neighbors are randomly chosen. Synthetic samples are generated in the following way: Take the difference between the feature vector (sample) under consideration and its nearest neighbor. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general [26].
Boosting is a machine learning technique that can be used for regression and classification problems. It generates a weak learner at each step and accumulates it into the total model. If the weak learner for each step is based on the gradient direction of the loss function, it can be called Gradient Boosting Decision Tree (GBDT) [27]. The difference with GBDT is that the loss function only uses the first derivative of the loss function when calculating the objective function. XGBoost approximates the loss function with the second-order Taylor expansion. The main algorithm is as follows.
Assume that a dataset is , , , R , then we get n observations with m features each and with a corresponding variable y . Let ŷ be defined as a result given by an ensemble represented by the generalized model as follows: (8) where k f is a regression tree, and ki f x represents the score given by the k-th tree to the i-th observation in data. In order to functions k f , the following regularized objective function should be minimized: where l is the loss function. To prevent too large complexity of the model, the penalty term Ω is included as follows: where and are parameters controlling penalty for the number of leaves T and magnitude of leaf weights respectively. The purpose of k f is to prevent over-fitting and to simplify models produced by this algorithm.
An iterative method is used to minimize the objective function. The objective function that minimized in j-th iterative to add j f is: (11) Eq(11) can be simplified by using Taylor expansion. And a formula can be derived for loss reduction after the tree split from given node: where I is a subset of the available observations in the current node and The XGBoost algorithm has many advantages, which prevents over fitting by increasing the complexity and compression in the loss function; optimizes the number of iterations through cross-validation, and improves the calculation efficiency of the model greatly through parallel processing. This algorithm is implemented in the "xgboost" package for the "Python" language provided by the creators of the algorithm.

Model Assessment Method
The confusion matrix [28] is a classical method for evaluating the results of classification models: where ij N represents the probability that class i is divided into class j on the verification set.
Accuracy, recall, and F1-score [28] play a role in the evaluation of the classification model. Through these complementary evaluation indexes, combined with the results of the confusion matrix, the algorithm model can be evaluated, optimized and screened, and the optimal algorithm model suitable for the data can be obtained. Accuracy refers to the ratio of the predicted correct number in the test results of the test set to the total number of samples, which is expressed as follows: The precision of class i indicates the ratio between the number of class i predicted correctly and the number of class i predicted in the test set test results, which is expressed as follows: F1-score is calculated by Precision and Recall. Because these two values are not intuitive enough, they are more intuitive after conversion. The larger the value, the better the result. The formula is as follows: where P is the accuracy and R represents the recall rate. Through the accuracy rate, recall rate, and F1-score, which can reflect the quality of classification results, we can adjust and optimize the classification model.
For a better grasp of the proposed fault diagnosis of the steam turbine process, we summarize the main procedures as follows.
Step 1, Performance indicator extraction. The t-SNE algorithm is used to reduce dimensions. Then, cluster analysis is performed on the low-dimensional data.
Combined with the fault records, the fault data and health data of the clustering result are distinguished.
Step 2, Imbalanced data recognition model. The imbalance problem in the data is processed by the SMOTE algorithm. We used the XGBoost algorithm to solve this multiclassification problem.
Step 3, Model Assessment Method. The confusion matrix is used for evaluating the results of classification models Figure 1 shows a schematic diagram for the proposed fault diagnosis of the steam turbine process in this paper.

Data Sources and Introduction
The data set in this paper is derived from the time-series data of the steam turbine in thermal power in China. After excluding measurement points with severe data loss and no data records, the data of steam turbines contains 35 variables such as the time stamp, operating condition parameters and status parameters. The acquisition time was 8 months and with the sample size of about 340,000. Table 1 shows the statistical information of the steam turbine data set. In addition, more detailed information of the data set can be seen in the Appendix, Table 1. The fault information comes from the fault records manually recorded by the power plant, including the fault content and recording time. As some fault records are not equipment operation faults and cannot be effectively identified by the data, the available fault information needs to be screened. The fault records selected for use are shown in Table 2.

Setting labels for different faults or normal
We use the fault discovery time in the data record to determine the time that the fault occurred. The data of 8-24 hours before and after each fault record of the steam turbine is intercepted for analysis. Firstly, the t-SNE algorithm in the previous section is adopted to map the 34-dimensional data to a 2-dimensional space. Then, the K-means clustering method is used to separate the fault data from the health data. The algorithm processing results are visualized as shown in Figure 2. Green data points represent health data, and other colors represent different fault data points.  Table 3 shows the information of failure data for five types. Compared with the time of fault records, it can be seen that this method can distinguish fault data and health data of steam turbines and it has a certain forethought.

Dealing with data imbalance
After labeling the data, the problem to be solved is the data imbalance. The total number of fault data is 5,118 and the number of normal data is 335,350. The ratio of normal data to fault data is about 67:1, which has a very high imbalance. It must be processed before building a classification model. Immediate imbalance processing of this data set will also introduce a lot of noise, affecting the accuracy of subsequent classification algorithms.
The normal data were sampled in sections, and the data of one day every four days were extracted and reassembled into the normal data. The SMOTE algorithm was used to deal with the unbalanced data of the newly formed data, and the resulting sample data set is shown in Table 4.

Test Results
After optimizing the data imbalance in the previous section, the XGBoost algorithm can be used for fault diagnosis. The data set was divided into a training set and test set and divided according to the ratio of 3:7. The results of the confusion matrix are shown in Table 5.  To better calculate the performance of the model, precision, recall rate and F1-score is calculated, and the calculation results are shown in Table 6.  Table 5 and Table 6 show the classification results of the model for five faults of steam turbine. As we all known, for a classification model, when precision, recall and F1-score have higher values at the same time without considering other factors, the model is thought to have a better performance. The model based on the XGBoost classifier has high accuracy in fault diagnosis of steam turbines and can identify the different types of faults.

Conclusions
To detect the early failure of steam turbines, a model based on the t-SNE and XGBoost was proposed. The model with high accuracy was verified by the data on steam turbine units of thermal power plants in China.
(1) The uncertainty problem of feature extraction in the unlabeled dataset was solved using t-SNE and K-means. This method can distinguish fault data and health data, and it has a certain forethought because it distinguished the time when the fault occurs, which was earlier than the fault record of manual inspection, making it more suitable for the practical application for the fault diagnosis of steam turbines.
(2) The problem of data imbalance caused by less fault records is solved by using the smote algorithm, which is of great significance to the fault diagnosis of the steam turbine and other mechanical equipment with less fault samples.
(3) In the identification of new data, the accuracy and other indicators of the model based on XGBoost has reached more than 96%, which shows that this method has a high value in turbine fault diagnosis.