Simultaneous-Fault Diagnosis of Gas Turbine Generator Systems Using a Pairwise-Coupled Probabilistic Classifier

A reliable fault diagnostic system for gas turbine generator system (GTGS), which is complicated and inherent with many types of component faults, is essential to avoid the interruption of electricity supply. However, the GTGS diagnosis faces challenges in terms of the existence of simultaneous-fault diagnosis and high cost in acquiring the exponentially increased simultaneous-fault vibration signals for constructing the diagnostic system. This research proposes a new diagnostic framework combining feature extraction, pairwise-coupled probabilistic classifier, and decision threshold optimization. The feature extraction module adopts wavelet packet transform and time-domain statistical features to extract vibration signal features. Kernel principal component analysis is then applied to further reduce the redundant features. The features of single faults in a simultaneous-fault pattern are extracted and then detected using a probabilistic classifier, namely, pairwise-coupled relevance vector machine, which is trained with single-fault patterns only. Therefore, the training dataset of simultaneous-fault patterns is unnecessary. To optimize the decision threshold, this research proposes to use grid searchmethodwhich can ensure a global solution as comparedwith traditional computational intelligence techniques. Experimental results show that the proposed framework performs well for both single-fault and simultaneous-fault diagnosis and is superior to the frameworks without feature extraction and pairwise coupling.


Introduction
The gas turbine generator system (GTGS) is commonly used in many power plants.The main components of a GTGS are power turbine, gearbox, flywheel, and asynchronous generator.In the first phase of the GTGS, the power turbine is driven by the exhaust gas; the output of the turbine then drives a gearbox that connects to the flywheel which keeps constant moment of inertia to protect the generator from a sudden stop.Finally, the rotating flywheel drives the asynchronous generator to generate the electric power.The system is designed to run 24 hours per day.Any abnormal situation of the GTGS will interrupt the electricity supply to cause enormous economic loss.The traditional manual inspection on the GTGS can hardly accomplish the faultmonitoring task because the GTGS is complicated and also more than one fault may appear at a time.This kind of problem refers to simultaneous-fault diagnosis.To prevent the interruption of electricity supply, development of an intelligent fault diagnostic system for single and simultaneous faults of the GTGS is a promising research topic.
According to the existing literature, no physical diagnostic model for simultaneous-fault diagnosis about power generator system like GTGS is available.One possible solution for intelligent fault detection is to use machine learning methods, like neural networks (NNs) and support vector machines, to learn the normal and abnormal vibration signal patterns.A classifier is then built for fault detection of unseen signal patterns.
In recent years, researches on the development of neural network-based monitoring systems for the GTGS and rotating machinery were available [1][2][3][4].However, the neural network classifiers have many drawbacks, such as local minima, time consuming for determination of optimal network structure, and risk of overfitting.To date, a number of researchers have already applied support vector machines (SVMs) to diagnose rotating machine faults and other engineering diagnosis problems [5][6][7][8][9][10][11][12][13] and have shown that SVM is superior to traditional NNs [8,10,11].The major advantages of SVM are global optimum and higher generalization capability [8,11].Nevertheless, those classification systems can detect single fault only.Even though the simultaneous fault can also be detected by the concept of single-label classification, in which both single and simultaneous-faults are considered as independent labels, they need many simultaneous-fault training patterns.However, transformation of simultaneousfault diagnosis to single-label classification suffers from two drawbacks.Firstly, the transformation significantly increases the required number of the training data because a set of new classes will be artificially generated via combination of single faults, which makes it impractical to handle industrial problems in medium to large level of faults.For example, with  single-faults (labels), there are 2  − ( + 1) artificial simultaneous-fault labels.Each label requires a certain amount of training dataset.So, it is impractical and prohibitive because acquiring all the possible simultaneousfault patterns is costly and hard to collect.Secondly, it is very difficult to include new faults in the system later.It is because of the addition of one single fault that the number of combination of single faults for constructing the artificial simultaneous-fault labels will be increased exponentially so as to require a huge amount of corresponding training patterns too.Another way to solve the problem of simultaneous-fault diagnosis is the framework called hierarchical artificial neural network (HANN) which is constructed by several stages of NNs.Each of the HANN stage is constructed with a set of NNs as described in [2,[14][15][16].HANN also suffers from the shortcoming that its architecture will become too complex to handle when extending to medium or large scale.With the increase of the number of single-faults, a large amount of NNs must be constructed in every HANN stage.
To overcome the aforesaid limitations, one possible way is to develop a classifier which can diagnose both single and simultaneous faults, while it is trained with single fault data only.This idea refers to multilabel classification.It is believed in the area of GTGS that the features of single-fault signal patterns can be found from the simultaneous-fault patterns, if a proper feature extraction technique can be selected.Should this hypothesis be true, it is possible to develop such kind of classifier by using proper feature extraction techniques.Therefore, one of the originalities of this research is to explore the feasibility of detection of simultaneous-fault vibration patterns of rotary machinery by a classifier which is only trained with single-fault patterns.To exam the feasibility, experimental evaluation is proposed in this study.
As far as feature extraction is concerned, there are currently several feature extraction techniques for fault detection of rotary machinery, including wavelet packet transform (WPT) [4,8,17], independent component analysis [6], and time-domain statistical features (TDSF) [13,18].By reviewing the literature, WPT and TDSF are commonly used for rotating machinery so that both techniques are considered in this research.After performing feature extraction, there may be still some irrelevant and redundant information in the extracted features.It is well known that if the number of inputs of a fault classifier is too huge, its accuracy will be degenerated unless a huge number of training data is available, but it is impractical.To resolve this problem, a feature selection method should be employed to wipe off irrelevant and redundant information such that the number of inputs of the classifier can be reduced, resulting in improvement on diagnostic accuracy.Currently, there are some feature selection approaches available, including compensation distance evaluation technique (CDET) [18], kernel principal component analysis (KPCA) [19], and genetic algorithm (GA)-based methods [10,20].Although CDET and GA-based methods provide a good solution, the optimal threshold in CDET is difficult to set, and the result of GA is unrepeatable.In other words, when a GA is run for two times, two different results will be obtained.In this way, KPCA is considered in this study to reduce the dimensional scale of the information content.The term of feature extraction hereafter refers to feature extraction and feature selection.
From the practical point of view, a proper classifier has to offer the probabilities of all possible faults.Then the user can at least trace the other possible faults according to the rank of their probabilities when the predicted fault(s) from the classifier is (are) incorrect in the problem.Moreover, it is our belief that comparing the similarity of a simultaneousfault input pattern with every single fault pattern is the main mechanism for simultaneous-fault detection without learning any simultaneous-fault training data.One possible way to represent the degree of similarity of two patterns is the probability of likeness.Therefore, it is logical to employ probabilistic classifier for simultaneous-fault diagnosis because they can determine the probability of each label.Typically, probabilistic neural network (PNN) [21,22] was employed as a probabilistic classifier.It was shown in [21] that the performance of PNN is superior to SVMbased method for multilabel classification.However, the main drawback of PNN lies in the limited number of inputs because the complexity of the network and the training time are heavily related to the number of inputs.Recently, Widodo et al. [23] proposed to apply an advance classifier, namely, relevance vector machine (RVM), to the fault diagnosis of low speed bearings.They showed that RVM is superior to SVM in terms of diagnostic accuracy.RVM is a statistical learning method proposed by Tipping [24], which trains a probabilistic classifier with sparser model using Bayesian framework.RVM can be extended to multiclass version using one-versus-all (1vA) strategy.However, this strategy was verified to produce a large region of indecision [25,26].In view of this drawback, this research proposes to incorporate pairwise coupling, that is, one-versus-one (1v1) strategy, into RVM, namely, pairwise-coupled relevance vector machine (PCRVM).As the pairwise coupling strategy considers the correlation between every pair of fault labels, a more accurate estimate of label probabilities for simultaneous-fault signals can be achieved.The detailed explanation of the advantage of 1v1 over 1vA is discussed in Section 2.3.
If a probabilistic classification is applied to fault detection, the predicted fault is usually inferred as the one with the largest probability.The other alternative approach is that the probabilistic classifier ranks all the possible faults according to their probabilities and lets the engineer make a decision.These inference approaches work fine with singlefault detection but fail to determine which faults occur simultaneously in the simultaneous-fault problem.It is because the engineer cannot identify the number of simultaneous faults based on the output probability of each label.For instance, an output probability vector for five labels is given as [0.49, 0.5, 0.68, 0.1, 0.7].In this example, it is difficult for the engineer to judge whether the simultaneous faults are labels 2, 3, and 5. To identify the number of simultaneous faults, a decision threshold must be introduced, and thus a step of decision threshold optimization is proposed in the current framework other than feature extraction and probabilistic classification.
As a summary, a framework combining feature extraction (WPT + TDSF + KPCA), pairwise-coupled relevance vector machine (PCRVM), and effective decision threshold optimization is proposed for simultaneous-fault diagnosis of the GTGS.Even though the authors also proposed a similar framework for simultaneous fault diagnosis of automotive engine ignition systems [27], the framework was designed for simple waveforms, whereas the signal patterns of this application are high frequency oscillating signals.Moreover, one of the classification criteria and decision threshold optimization method in [27], respectively, rely on domainspecific knowledge and computational intelligence techniques, such as genetic algorithm (GA) and particle swarm optimization (PSO), which may produce a local optimal solution.Therefore, the framework in [27] cannot be directly applied; it should be modified significantly, particularly in the phases of feature extraction and the threshold optimization.
This paper is organized as follows.Section 2 presents the proposed framework and the related techniques.Experimental setup and sample data acquisition are discussed in Section 3. Section 4 discusses the experimental results of PCRVM and its comparison with typical PNN [21,22], an existing single-label SVM classifier [28], and pairwisecoupled probabilistic neural network (PCPNN) in order to show the superiority of RVM and effectiveness of pairwise coupling strategy.Finally, a conclusion is given in Section 5.

Proposed Simultaneous-Fault Diagnosis Framework
The proposed simultaneous-fault diagnostic framework for the GTGS and its evaluation approach are shown in Figure 1.The diagnostic model, using pairwise-coupled probabilistic classifier (PCPNN or PCRVM), is trained using the processed training dataset  Proc-Train .The output of the trained classifier, together with processed validation dataset  Proc-Vali , is used to perform parameter optimization for the probabilistic classifier (i.e., the spread of PCPNN or the width of PCRVM) and the decision threshold.The optimized parameter for the classifier is used to construct the final PCRVM or PCPNN-based diagnostic model, and the optimal threshold is utilized to identify simultaneous-faults for the unseen test dataset,  Proc-Test .The performance evaluation submodule adopts -measure to test  Proc-Test .The details of the four submodules in the framework are discussed in the following subsections.

Wavelet Packet Transform.
In the last two decades, wavelet transform (WT) has been widely applied in random signal processing.The transform of a signal is just another form of representing the signal.It does not change the information content presented in the signal.In WT, multiresolution technique is used, and different frequencies are analyzed with different resolutions in order to provide a time-frequency representation of the signal.Wavelet packet transform (WPT) derives from the WT family.WPT is a generalization of wavelet decomposition that offers a richer signal analysis.In the decomposition of a signal by DWT, only the lower frequency band is decomposed, giving a right recursive binary tree structure whose right lobe represents the lower frequency band, and its left lobe is the higher frequency band.In the corresponding WPT decomposition, the lower as well as the higher frequency bands are decomposed giving a balanced binary tree structure.Therefore, the WPT has the same frequency bandwidths in each resolution while the traditional discrete wavelet transform does not have this property.Therefore, WPT is suitable for processing of nonstationary signals, like the vibration signal, because the same frequency bandwidths can provide good resolution regardless of high and low frequencies.

Kernel Principal Component Analysis.
Principal component analysis (PCA) is a popular statistical method for principal component feature extraction.PCA always performs well in dimensionality reduction when the input variables are linearly correlated.However, for nonlinear  cases, PCA cannot give a good performance.Hence, PCA is extended to nonlinear version under SVM formulation and is called kernel PCA (KPCA), which has been used to solve many application problems [29,30].KPCA involves solving  in the following set of equations:

𝜌
where Ω , = ( WT  ,  WT  ) for ,  = 1, . . ., ,  is the number of data for KPCA,  WT  ,  WT  ∈   , and  is the dimension of the input data.The vector  = [ 1 , . . .,   ]  is the eigenvector of Ω , , and  ∈  is the corresponding eigenvalue.The transformed variables (score variables)   for vector  become where  , is the th element in the eigenvector  corresponding to the th largest eigenvalue,  = 1 to , and  is the largest number such that eigenvalue   of the eigenvector   is nonzero.Therefore, based on the  pairs of (  ,   ), the input vector  WT ∈   can be transformed to a nonlinearly uncorrelated variable  = [ 1 , . . .,   ] where  > .One more point to note is that the eigenvectors   should satisfy the normalization condition of unit length: where To produce a further reduced feature vector, a postpruning procedure can be done.That is, after acquiring the  pairs of (  ,   ), all   are normalized to    which satisfy the constraint Based on these normalized eigenvalues    , the smallest    are deleted until ∑  =1    ≤ 0.95.With the index  < , the eigenvectors   ( = 1 to ) are selected to produce a reduced feature vector which retains 95% of the information content in the transformed features.Usually a 5% of information loss is a rule of thumb for dimensionality reduction.

Relevance Vector Machine. Relevance vector machine (RVM) is a recently available machine learning method.
Theoretically, RVM is a statistical learning method utilizing Bayesian learning framework and popular kernels.In this research, predicting the posterior probability of each fault   for unseen symptoms f is conducted by RVM based on experimental data.Given a set of training data (f, t) = {f  ,   },  = 1 to ,   ∈ {0, 1}, and  is the number of training data.It follows the statistical convention and generalizes the linear model by applying the logistic sigmoid function ((f)) = 1/(1+exp(−(f))) to the predicted decision (f) and adopting the Bernoulli distribution for (t | F), and the likelihood of the data is written as [24] and w = ( 0 ,  2 , . . .,   ) T are the adjustable parameters. is a radial basis function (RBF) since RBF kernel is usually adopted for classification problems.
The optimal weight vector w for the given dataset needs to be computed so as to maximize the probability (w | t, F, ) ∝ (t | F, w)(w | ), with  = [ 0 ,  1 , . . .,   ], a vector of +1 hyperparameters.However, the weights cannot be determined analytically.Thus, the following approximation procedure is chosen, which is based on Laplace's method.
(c) The hyperparameter vector  is updated using an iterative reestimation equation.Firstly, randomly guess   , and calculate   = 1 −   ∑  where ∑  is the th diagonal element of the covariance matrix ∑.Then reestimate   as follows: Boundary constructed using one-versus-all Boundary constructed using pairwise coupling where  3.
There are several available methods for pairwise coupling strategy [25], which are, however, unsuitable for simultaneous-fault diagnosis because of the constraint ∑   = 1.Note that the nature of simultaneous-fault diagnosis is that ∑   is unnecessarily equal to 1. Therefore, the following simple pairwise coupling strategy for simultaneous-fault diagnosis is proposed.Every   is calculated as where   is the number of training feature vectors with either th or th labels.Hence, the probability can be more accurately estimated from   =   (f) because the  pairwise correlation between the labels is taken into account.
With the previous pairwise coupling strategy, the proposed framework, PCRVM, could estimate the probability vector  in high accuracy level and, hence, generates a higher classification accuracy for simultaneous-fault diagnosis.

Decision Threshold.
The pairwise-coupled probabilistic classification could produce a probability vector  = [ 1 ,  2 , . . .,   , . . .  ], where  is the number of the singlefault labels, indicating the probabilistic occurrence of every single-fault.This probability vector  can be provided to the user as a quantitative measure for reference and further processing.Practically, it is desirable to identify the multilabel decision as a decision vector y = [ 1 ,  2 , . . .,   , . . .,   ] for all   ∈ {0, 1}, which is, however, not directly available from the probability vector .By applying a decision threshold , y could be derived from , that is, y = DT() = [( 1 ), ( 2 ), . . ., (  ), . . ., (  )], where where  ∈ [0 ⋅ ⋅ ⋅ 1] is a decision threshold and   indicates if f belongs to the th label or not,  = 1 to .For example, if  = 0.5 and  =  class (f) = [0.62,0.71, 0.35, 0.46, 0.85], then y = DT() = [1, 1, 0, 0, 1].Therefore, f is diagnosed as a simultaneous-fault (1, 2, 5).Note that the decision threshold  is the major factor affecting the classification accuracy.The decision threshold is domain specific and sensitive to the diagnostic accuracy.The diagnosis of GTGS requires an efficient searching algorithm for optimal setting of the decision threshold.In this study, a simple and effective method, grid search (GS), is adopted.The search region for the decision threshold was set within the range of 0 to 1.By applying a reasonable small interval, say 0.01, a series of candidate thresholds were generated.After evaluating the performance index, -measure, of various candidate thresholds using the validation dataset, the best threshold  opt can be obtained.Even though GS method is time-consuming, it can ensure a global solution, if the grids can cover the whole searching space.As a result, GS will not stick in a local optimal solution as compared with the computational intelligent techniques such as GA and PSO in [27,32].Thankfully, the threshold lies between 0 and 1, so the trial time is not very long.

Evaluation.
The traditional statistical measure of classification accuracy only considers exact matching of the decision vector y against the true vector t.This evaluation is, however, unsuitable for simultaneous-fault diagnosis where partial matching is preferred.Therefore, a well-known and common evaluation method called -measure [32][33][34] is employed.-measure is mostly used as a performance evaluation for information retrieval systems where a document may belong to a single or multiple tags simultaneously, which is very similar to our current study.By using -measure, the evaluation of both single-fault and simultaneous-fault test cases can be fairly examined.The definition of -measure is given in (10).

Summary of the Proposed
Framework.The proposed framework and techniques are summarized in Figure 4.  threshold based on the validation set  Pro Vali (F 1 ∪ F  ) and -measure  me , (Figure 4(c)). class produces the probability vector  = [ 1 ,  2 , . . .,   , . . .,   ] for each case in  Pro Vali (F 1 ∪ F  ).To optimize the parameters of the classifier and decision threshold, the fitness of each candidate parameter is obtained and evaluated based on the -measure  me over  Pro Vali (F 1 ∪ F  ) under 5-fold cross-validation strategy.Finally, the test dataset  Pro Test (F 1 ∪F  ) is used for evaluating the performance of the proposed framework based on the optimal classifier parameter and decision threshold obtained.

Experimental Setup and Data Preparation for a Case Study
To obtain representative sample data for model construction and verify the effectiveness of the proposed framework, experiments were carried out.The details of the experiments are discussed in the following subsections, followed by the corresponding results and comparisons.All the proposed methods mentioned were implemented by using Matlab  R2008a and executed on a PC with a Core 2 Duo E6750 @ 2.13 GHz with 4 GB RAM onboard.

Test Rig and Sample Data Acquisition.
The experiments were performed on a test rig as shown in Figure 5, which can simulate the GTGS of the Macau power The test rig includes a computer for data acquisition, an electric load simulator, a prime mover, a gearbox, a flywheel, and an asynchronous generator.As it is not realistic to implement the diagnostic system to monitor all the components in the GTGS in one study, this research selected the fault detection of the gearbox as a case study.The test rig can simulate many common faults in the gearbox of the GTGS, such as unbalance, misalignment, and gear crack.A total of 13 cases, including one normal case, 8 single-faults, and 4 simultaneous-faults in the gearbox, were simulated in the test rig in order to generate sample training and test dataset.Some samples for single-fault and simultaneous-fault patterns are shown in Figures 6 and 7, respectively.Figures 6 and 7 show that the signal profiles between single-faults and simultaneous-faults are very similar, which make them difficult to be distinguished manually, but their degrees of similarities can be detected using the proposed framework.
Table 1 shows the detailed descriptions of the thirteen simulated cases in which the mechanical misalignment of the gearbox was simulated by adjusting one height of the gearbox with shims, and the mechanical unbalance was simulated by adding one eccentric mass on the output shaft.The sample vibration data were acquired by two triaxial accelerometers located on the outer case of the gearbox as shown in Figure 5.The accelerometers are used to record the gearbox vibration signals along the horizontal and vertical directions, respectively.In the axial direction, the vibration signal is ignored since the test rig uses spur gears in which the vibration along the axial direction is not obvious.To construct and test the diagnostic framework, each simulated single fault was repeated 200 times and 100 times for each simultaneous-fault under various random electric loads.Each time, 2 seconds of vibration data was recorded with a sampling rate of 2048 Hz.

Feature Extraction by WPT and TDSF.
Feature extraction is the determination of a feature vector from a signal with minimal loss of important information.A feature vector could usually be a reduced-dimensional representation of that signal so as to reduce the modeling complexity and computational cost.Through WPT, a set of 2  subbands of a signal can be obtained, and L is the level of WPT decomposition.
With reference to the literature [8], statistical characteristics can be used for representing signal subbands effectively so that the dimensionality of a feature vector extracted from vibration signals can be reduced.For each signal subband, the statistical characteristics include the following: (1) maximum of the wavelet coefficients, (2) minimum of the wavelet coefficients, (3) mean of the wavelet coefficients, (4) standard deviation of the wavelet coefficients.
In this case study, for a vibration signal of 4096 sample points, there are 4096 wavelet coefficients and 2 4 = 16 subbands after  = 4 levels of WPT decomposition.However, using the previous statistics, there are only 2 4 ×4 = 64 features which greatly reduce the input complexity (from 4096 inputs to 64 inputs) for the next stage.For a good presentation, the term of WPT hereafter refers to the process of wavelet packet decomposition together with the previous calculating statistics.After decomposition by WPT, time-domain statistical method is usually further employed to extract time-domain features of the raw signals which provide the physical characteristics of time series data.For instance, references [12,13] applied time-domain statistical features for fault detection on gear trains and low speed bearings, such as mean, standard deviation, skewness, crest factor, and kurtosis, respectively.In this study, 10 statistical time-domain features are employed to analyze the vibration signal.Table 3 presents the statistical time-domain features.After feature extraction by WPT and TDSF, the number of extracted features is shown in Table 4.

Dimension Reduction by KPCA.
Although the useful features can be extracted by WPT and TDSF, the dimension of these extracted features is still high.Such high dimension can degrade the diagnostic performance.To tackle this issue, KPCA is applied to obtain a small set of principal components of the extracted features.With the eigenvalues obtained from KPCA, the unimportant transformed features could be deleted.Therefore, only a limited number of the principal components are necessary, and 95% of the information in the features can be retained.

Normalization.
To ensure all the features having even contribution, all reduced features should go through normalization.The interval of normalization is within [−1, 1].The extracted feature is normalized by the following formula: where  KPCA is an output feature after going through KPCA and  is the result of normalization.After normalization, a processed dataset  Proc is obtained.The pairwise-coupled probabilistic classification algorithm can then be employed to construct the fault classifier based on  Proc-TRAIN .

Experimental Results and Discussion
To verify the effectiveness of the proposed framework, a set of experiments using different combinations of methods were carried out by using the validation and test datasets.The performance evaluations for all experiments were done based on the -measure.a set of experiments were carried out to determine the best combination of the system configuration.In the phase of WPT, mother wavelet and the level of decomposition  were selected according to a trial-and-error method.In the family of mother wavelets, the Daubechies wavelet (Db) is the most popular one and, hence, employed for experiments.In this case study, three Daubechies wavelets Db3, Db4, and Db5 were tried, and, hence, the range of  was set from 3 to 5.Moreover, three different kernel functions for KPCA, namely, linear, radial basis function (RBF), and polynomial, were tested.Different kernel functions have various hyperparameters for adjustment.However, it is very time consuming to try all the different values of hyperparameters.To reduce the number of trials, the hyperparameter  of RBF based on 2 V was tried for V ranged from −3 to +3, and the hyperparameter   of polynomial kernel was taken from 2 to 5.Moreover, the common width for both RVM and PCRVM and the common spread for both PNN and PCPNN and the decision threshold were, respectively, assumed to be 1,1 and 0.5 in the course of determination of the best configuration of feature extraction module.Under this configuration, the best combination of kernel function and its parameter for feature extraction (WPT + TDSF + KPCA) is presented in Table 5.
According to experimental results not listed here, the classifiers using KPCA under the polynomial kernel with   = 4 and the mother wavelet of Db4 with level 4 reach the highest diagnostic accuracies, and, hence, this combination of feature extraction techniques was finally selected.Table 5 also indicates that 77 principal components are obtained by using this combination of feature extraction techniques.In other words, a raw signal of 16384 data points can be transformed to 77 features as the input variables of the classifiers.
After determining the configuration of the feature extraction module, the next step is to determine the optimal parameters of the classifiers by using the grid search.As mentioned previously, different probabilistic classifiers have their own hyperparameters for tuning.PNN/PCPNN uses spread  and RVM/PCRVM employs width .In this case study, the value of  was examined from 1 to 3, at an interval of 0.5, and  was selected from 1 to 8 at an interval of 0.5.To find the optimal decision threshold, the search region was set within 0 to 1, at an interval of 0.01.Then 5-fold cross-validation was applied to  Proc Vali to determine the best combination of the parameters.Finally, the best hyperparameters and thresholds for PCRVM (, ) and PCPNN (, ) were found to be (6.5, 0.67) and (1, 0.76), respectively.To verify the effectiveness of the pairwise coupling strategy, a set of experiments without pairwise coupling were carried out using one-versus-all strategy, and the experimental results are shown in Table 6 (experiments 1 to 4).Comparing the results with the results of experiments 5 to 8 where pairwise coupling strategy is employed, the accuracies in the experiments without pairwise coupling are generally 2% to 5% worse.The main reason is that only  binary classifiers were constructed for  labels in the one-versus-all strategy, so that there are many indecision regions between pairs of classes.Therefore, when a test case lies on these regions, the classifiers mostly fail to classify the faults correctly.However, the classifiers with pairwise coupling strategy (i.e., PCPNN and PCRVM) can minimize those indecision regions.Hence, a probabilistic classifier with pairwise coupling strategy is an effective approach to improve the diagnosis accuracy.In this case study, the proposed framework in experiment 8 achieves the best -measure of 0.9129.Note that the proposed framework only employs a training set of single-fault patterns to construct the classifier while the overall performance is evaluated over both single-fault and simultaneous-fault test patterns.Therefore, the proposed method can successfully detect simultaneous-faults without costly simultaneous-fault training patterns.The effectiveness of the proposed framework is further evaluated by looking at the size of the training dataset.In the training of the proposed diagnostic model, only single-fault patterns are processed.From Table 2, there are 160 simultaneous-fault patterns in  Train  or  Proc Train  while there are totally 1240 (i.e., 1080 + 160) patterns of both single-faults and simultaneous-faults.When the 160 simultaneous-fault patterns are not required, a significant reduction of training patterns at 12.9% (i.e., (160/1240) × 100%) is achieved.In fact, the time reduction is even more significant because simultaneous-faults are more costly and difficult to acquire.
In order to further verify the superiority of the proposed framework, the latest single-label binary classification framework based on SVM [28] was also applied to the same datasets for comparison.The evaluation results are shown in Table 7, in which the overall diagnostic accuracy of the single-label method is lower than that of the proposed PCRVM framework.Therefore, it can be concluded that the proposed framework is currently the best method for single and simultaneous-fault diagnosis of the GTGS.

Conclusions
In this paper, simultaneous-fault diagnosis for the GTGS is studied.A systematic framework combining feature extraction, pairwise coupling probabilistic classification, and parameter optimization based on a partial-match assessment has been developed to overcome the challenges

1 :
Dataset of single-faults : Dataset of simultaneous-faults F 1 : Feature matrix of single-faults F S : Feature matrix of simultaneous-faults : Probability vector y: Decision vector GS: Grid search s: Spread of PNN w: Width of RVM opt : Optimal threshold w opt : Optimal width of RVM D KPCA-Test D KPCA-Train D KPCA-Vali s opt /w opt s opt : Optimal spread of PNN

Figure 1 :
Figure 1: Proposed simultaneous-fault diagnostic framework and evaluation approach for GTGS.
(a) For the current fixed values of , the most probable weights w MP are found, which is the location of the posterior mode.Since (w | t, F, ) ∝ (t | F, w)(w | ), this step is equivalent to the following maximization:
contains d − 1 pairwise classifier C jk ≠ j.Since C jk and C kj are complementary, d(d − 1)/2 pairwise classifiers.

Figure 4 (DD
a) shows the workflow of combining technologies including WPT, TDSF, and KPCA as feature extraction submodule.Every dataset for training, validation, and testing is required to go through the feature extraction process.Figure 4(b) shows the architecture of the classifier  class , where the pairwise coupling is deployed as depicted in Figure 3. Then the classifier is passed to a grid search optimizer to search for the optimal parameter of the classifier (i.e.,  for PCRVM/RVM or  for PNN/PCPNN) and the decision Proc Vali (F 1 ∪ F S ) [ 1 ,  2 , . . .,  j , . . .,  d ] Proc Test (F 1 ∪ F S ) [ 1 ,  2 , . . .,  j , . . .,  d ] evaluation F-measure (d) Evaluation

Figure 4 :
Figure 4: Detailed workflow of the proposed framework.

Figure 5 :
Figure 5: Fault simulator for gas turbine generator system.

Figure 6 :
Figure 6: Sample normalized single-fault and normal patterns of GTGS.
WT-Train , validation feature set,  WT-Vali , and the unseen vibration signal for testing,  WT-Test , are generated.Considering the existence of irrelevant and redundant information in the extracted features, KPCA is then applied to remove useless information and further reduce the dimensions of  WT-Train ,  WT-Vali , and  WT-Test .The results are saved as  KPCA-Train ,  KPCA-Vali , and  KPCA-Test , respectively.In order to ensure all the features having even contribution, every feature in  KPCA-Train ,  KPCA-Vali , and  KPCA-Test is normalized within [−1, 1].The processed training dataset, validation dataset, and unseen signal for testing are named as  Proc-Train ,  Proc-Vali , and  Proc-Test , respectively.

Table 1 :
Sample single-faults and possible simultaneous-faults in the gearbox of GTGS.

Table 2 :
Division of the sample dataset into different subsets.
fault sample data (i.e., (1 normal case + 8 kinds of singlefaults) × 200 samples) and 400 simultaneous-fault sample data (i.e., 4 kinds of simultaneous-faults × 100 samples).In order to test the diagnostic performance for both singlefaults and simultaneous-faults, the sample data was divided into different subsets as shown in Table2, where  Valid 1 denotes the validation set of 360 single-fault patterns without feature extraction, and  Valid  denotes the validation set of the extracted features of 120 simultaneous-fault patterns.

Table 3 :
Definition of common statistical features in time-domain.  represents a signal series for  = 1, 2, . .., N, where N is the number of data points of a raw signal. Note:

Table 4 :
Features extracted under different decomposition levels of WPT.the best decision threshold for both PNN and PCPNN is 0.76, while 0.67 is for both RVM and PCRVM.Table 6 reveals that the experiments with feature extraction 2, 4, 6, and 8, show an increase of diagnostic accuracy by 4%-8% as compared with the experiments without feature extraction (1, 3, 5, and 7).These results indicate that the proposed feature extraction method (WPT + TDSF + KPAC) is effective.
simultaneous-faults, various combinations of training and test datasets are employed.For example, experiments 1, 3, 5, and 7 have no feature extraction so that the training dataset only uses the raw dataset  Train (D 1 ∪D  ), whereas the feature extracted dataset  Proc (F 1 ∪ F  ) was selected as training data for Experiments 2, 4, 6 and 8.As described in Section 4.1,

Table 5 :
Accuracies of various classifiers under the best combination of polynomial kernel of KPCA and mother wavelet of Db4/L4.

Table 6 :
Evaluation of different combinations of techniques using the best model parameters obtained.

Table 7 :
(4)parison of SVM, PCPNN, and PCRVM.In the proposed framework, the feature extraction module is designed by combining the techniques of WPT + TDSF + KPCA to effectively capture the single-fault components in the simultaneous-fault vibration patterns.A pairwise coupling strategy is also employed to deal with the interaction between the independent labels, which outperforms the approaches without pairwise coupling by 2% to 5% in simultaneous-fault diagnosis.In short, this research work makes contributions to four aspects: (1) it is the first research to find that the features of single-fault signal patterns of the GTGS can be found from their simultaneous-fault patterns by the proposed feature extraction method; (2) a high diagnostic accuracy for both unseen single-fault and simultaneousfault patterns is achieved by the proposed framework; (3) the proposed framework can achieve a high diagnostic efficiency while the request for large amount of expensive simultaneous-fault training data is no longer necessary;(4)it is the original application of the proposed framework to the problem of GTGS diagnosis.Since the proposed framework for simultaneous-fault diagnosis is general, it could be applied to other similar industrial problems.Feature matrix after feature extraction  Proc-Train : Training feature matrix after feature extraction  Proc-Vali : Validation feature matrix after feature extraction  Proc-Test : Test feature matrix after feature extraction   : th raw signal data  max : Upper limit of feature  min : Lower limit of feature  WT : Feature set extracted by WPT and TDSF  : Predicted decision   ( WT ): Transformed variables   for vector  WT