Power transformer fault diagnosis considering data imbalance and data set fusion

Department of Building Service Engineering, Hong Kong Polytechnic University, Hong Kong, China Academy for Advanced Interdisciplinary Studies and Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China Research Institute, State Grid Zhejiang Electric Power Co., Ltd, Hangzhou, China Zhejiang Huayun Information Technology Co., Ltd, Hangzhou, China State Grid Zhejiang Electric Power Co., Ltd, Lishui Power Supply Bureau, Lishui, China Guangdong Provincial Key Laboratory of Brain‐inspired Intelligent Computation, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China

With the development of sensor technology, DGA methods have been shifting from offline to online monitoring. Compared with manual collecting and testing the oil of power transformers periodically, an online DGA monitor measures the gas concentrations in realtime and reports the condition of the transformer timely. Thus, the reliability of the transformer can be improved, and simultaneously the manual maintenance can be reduced. However, it also brings new issues. The frequent nuisance warning or mis-warning of faults due to the low diagnosis accuracy makes this technique bring new burden to the industry. Frequently reporting faults will lead to extra maintenance costs. On the other hand, mistaken fault as health will lead to severe accidents. Therefore, to implement efficient online DGA of power transformer, a diagnostic tool with high accuracy is necessary.
During years of development, DGA method has evolved into two mainstreams: interpretation methods [9,10] and artificial intelligence (AI) based methods [11]. Interpretation methods include IEC 60,599 method [12], Key Gas method [13], Duval Triangular method [14], Doernenburg method [15] and Rogers method [16]. They identify different types of faults on the basis of gas ratios. These methods are developed empirically and no mathematical formulation can be used. According to the field experience, the diagnosis accuracy of interpretation methods is limited and cannot provide interpretation for every combination of gas ratio [17]. These methods may provide different diagnosis results for the same transformer condition [18].
To address these issues, various AI techniques have been developed based on dissolved gas concentrations such as fuzzy logic [19], artificial neural networks [20], support vector machine [21], self-organizing map [22] and others [23][24][25]. In contrast to interpretation methods, AI methods learn the underlying relationships between gas concentrations and transformers faults using quantitative evidence-based theory. They can provide a diagnosis with higher accuracy and are widely adopted in practical operation. However, AI based methods still have limitations on the transformer diagnosis.

| Shortage of high-quality data
AI tools are data-driven that they require large amounts of samples to cover the full range of expected variation and null cases. The performance of AI methods is highly dependent on the quality of the data set. However, the occurrence of a transformer fault is an event with a relatively small probability. It is hard for a utility to collect enough fault samples from its own transformer to perform fault diagnosis. Plenty of reported works use public data sets [14] and data published in the literature to train the AI model, consequently result in a model with poor generalization ability. Nowadays, with the wide application of online monitoring technology, the availability of data grows rapidly. However, nearly all the data corresponds to the healthy condition of the transformer. Therefore, fault data are still lacking.

| Lack of investigations on data set fusion
To relieve the data deficiency, data set fusion is an efficient method. As mentioned above, DGA data can be collected from many sources such as public data sets, literature and private companies. Data sets combining different sources can bring diversity to the training of AI algorithms, and benefit AI algorithms to obtain better generality. However, such a data set may also bring difficulties because the characteristics and distributions of the data set from different sources may be inconsistency and discrepancy [26]. As a result, the capability of AI algorithms on data fusion is questionable.

| Lack of discussion on imbalance data set
As mentioned previously, the health data is accumulated rapidly by the utility with the increasing of operating time, however, fault data are still rare. This leads to an imbalanced data set and the imbalance ratio between health and fault is continuously increasing. In this scenario, AI algorithms perform poorly because they are generally designed to handle a balance datasheet. To avoid this problem, previous studies always train AI algorithms using a balance datasheet or with a low imbalance ratio [26,27]. However, due to lots of health samples were abandoned, it will lose useful information in fault diagnosis. Theoretically, if the health data is collected sufficiently large and includes all the gas combinations of healthy state, the diagnosis can be completely correct. Therefore, we try to solve this problem in a converse way that we try to utilize the large number of health samples that we collect to improve the accuracy of diagnosis.
To address these limitations, this article presents a comprehensive investigation from the view of both data set and methodology. The first contribution of this article is to first show the influence of health samples on the performance of the transformer diagnosis. Previous studies mainly focus on fault samples. They train a classifier using a data set contains a large number of fault samples which is not the case for most utilities. The data set selected in this article is obtained from more than 100 transformers thus is very practical. The other contribution is providing an efficient algorithm for an imbalanced transformer diagnosis. As a huge number of health samples are involved, the data set is highly imbalanced. Traditional imbalanced classification algorithms are insufficient in this situation. Meanwhile, this article also provides a thorough discussion of data characteristics and reasons for performance differences for various algorithms. According to the results of the experiments, the above limitations are solved by using the method proposed in this article. The method proposed in this article helps power companies to improve the diagnosis accuracy from an entirely different perspective. Since this method does not excessively dependent on fault data, it is feasible and practical.
It is should be noted that after the condition of the power transformer is determined, various AI algorithms [28] can further identify the type of faults easily since the datasheet is now relatively balanced. Classification of this balance data has already been discussed elaborately in the literature and is not the focus of this article and thus will not be discussed.
The remainder of this article is structured as follows. Section 2 describes the data set fusion in this article and the data pre-processing techniques for classification. Data characteristic is also discussed in detail. In the following section, AI methods dealing with imbalanced data are briefly introduced. In Section 4, a novel algorithm for highly imbalanced data is presented. In Section 5, the experiment is carried out using different amounts of samples and different algorithms. The performance of oversampling algorithms and ensemble algorithms are compared and discussed in detail. Finally, the conclusion is drawn in Section 6.

| Introduction of data set
A fusion dissolved gas data set is constructed in this article. Five different data sets are mixed together and used to establish the fault diagnosis model and test its performance. It includes a public data set IEC TC10 data set [14], the data collected by the Ministry of Electricity and Energy of Egypt [10], the data published in [11], two private data sets provided by North China Grid and Zhejiang Grid. In these two private data sets, the data was collected from more than 100 power transformers of different manufacturers at different voltage levels. The data set consists of 9595 health samples (gathered from transformers in a healthy status) and 993 fault samples. The details of these data sets are listed in Table 1.

| Data characteristic analysis
DGA data typically present highly skewed distributions. As the variance range of DGA data is presented in Table 2, the dissolved gas can have low concentrations of zero ppm (parts per million), but also can attain tens of thousands of ppm for others. The extreme variances can be a source of numerical imprecisions and overflow in AI algorithms [29].
The data from different companies might present different characteristics. To reveal the distribution variation between different data sets, histograms of gas concentrations from IEC data set and Zhejiang data set are displayed in Figure 1. Due to the skewed distribution is difficult to visualise, logarithmic transformation is adopted on the data set. It shows that the concentrations of C 2 H 6 show opposite distribution centres for two data sets. In IEC data set, the distribution centre locates near five for health samples, on the contrary, this distribution centre corresponds to the fault samples in Zhejiang data set. Therefore, it brings the generalization issue when training the data using different data sets. Meanwhile, although some fault samples are far from data clusters, they cannot be regarded as noise.

| Pre-processing DGA data
As the data characteristic presented in the previous section, raw DGA data have sharp variations among different concentrations of dissolved gases. Using data of different scales will cause some issues in the accuracy and convergence of AI model. To ensure accurate, efficient, or meaningful analysis, raw DGA data requires to be first transformed into the same scale.
Data transformation methods, such as normalization, and logarithm transformation [11], have been applied to rescale DGA data. In this article, the logarithm transformation of base 10 is adopted. It not only can rescale data that extreme values are avoided, for example, log 10 (92,600) ¼ 4.97, but also transforms the gas ratios into linear relation, for example, log 10 (CH 4 /H 2 ) ¼ log 10 (CH 4 ) À log 10 (H 2 ). Thus, avoids overflows in AI algorithms and simplifies the relation among different gases. To avoid the infinity caused by the logarithm of zero, zero is replaced with a negligible quantity of 0.001. Finally, the data is shifted to the positive by minus its minimum, resulting in x new ¼ log 10 ðxÞ À log 10 ð0:001Þ ð1Þ where, x and x new are values of gas concentrations before and after transformation.

| IMBALANCED CLASSIFICATION ALGORITHMS
The fault detection procedure essentially involves a process of pattern recognition. AI algorithms attempt to learn the inherent correlation in the data and, can provide an accurate and reliable transformer diagnosis. However, most of AI algorithms are developed for balanced data sets. It performs poorly on imbalanced data sets that the separating boundary can be shifted toward the minority class. This shift can cause the generation of more false negative predictions, which lowers the model's performance on the minority positive class.
To deal with the imbalance, three categories [30,31] have been developed, namely resampling method, cost-sensitive method and ensemble method.
Resampling methods [32,33] are designed to reduce between-class imbalance by either undersampling or oversampling. Distance thresholding or clustering is the essential criterion in the resampling methods to determine adding or removing samples. However, undersampling may discard instances that contain potentially useful information, as well as, reduces the total number of training instances. On the other hand, oversampling of the minority instances may lead to overfitting, and it also suffers from large computational cost.
Compared with resampling, cost-sensitive methods [34,35] modify the existing balanced AI algorithms to modify its classification preference for most classes. They assign a higher cost to the misclassification of minority class during the training process. Consequently, more emphasis is put on the generalization of the minority class. A cost-sensitive classification technique takes the cost matrix into consideration during model building. A cost matrix encodes the penalty of classification from one class to another. However, the cost matrix on a specific task is given by domain expert before-hand, which is usually unavailable in many real-world problems.
Ensemble-based classifiers [36] are constructed by multiple classifiers and try to improve the generalization ability of classification by combining them to obtain a new classifier. The basic idea is to construct several classifiers from the original data and then aggregate their predictions when unknown instances are presented. General ensemble methods include bagging or boosting [37] and negative correlation learning [38]. Ensemble-based classifiers are also integrated with resampling methods such as SMOTEBoost [39], RUSBoost [40], OOB, and UOB [41].
For the above reasons, none of the prevailing methods can perfectly handle the imbalanced, large-scale and noisy classification tasks. These methods are sensitive to different factors. Their capacity for transformer diagnosis requires to be studied, especially when the data fusion is involved.

| SELF-PACED ENSEMBLE ALGORITHM
The primary reason for the inefficiency of the above imbalanced classification algorithms is lacking full consideration of data distribution for the training. Though new algorithms are constantly emerging, seldom consider the characteristics of the minority class distribution and its influence on classification performance. As claimed in Ref. [42], collecting the information about local characteristics of the minority class and distinguishing between safe, borderline, rare, and outlier examples is useful to differentiate the performance of basic classifiers.
In this section, we present an ensemble algorithm named Self-Paced Ensemble (SPE) [43]. It considers the distribution of classification based on the concept of 'classification hardness' and iteratively selects the most informative samples according to the hardness distribution.

| Classification hardness
To integrate the data distribution characteristics, the concept of 'classification hardness' is introduced [43]. As the name implies, hardness indicates samples that hard to predict for classifiers. The classification hardness function H is defined as the overall error that is calculated by summating errors of individual classifiers. Any loss function can be used as a classified hardness function. The simplest form of the hardness function is Absolute Error, which is used in this work.
In the transformer fault diagnosis, ensemble method divides majority (health data) into many bins and trains a sequence of classifiers. The classification is implemented by average of all these classifiers. Suppose F is the trained ensemble classifier which is composed by n individual classifiers f i . We use F(x i ) to denote the classifier's output probability of x i . Then the hardness of sample (x, y) with respect to F is given by the function H(x, y, F ) as where, x j is the vector of input features in ith classifier and y i is the corresponding vector of ground truth. The distribution of classification hardness contains information highly related to task difficulty, such as outliers, etc. Note that the classification hardness function uses the current model as one of the inputs to the function. Intuitively, the classification hardness gives the difficulty of classifying a specific sample for a 546specific classifier. By observing the hardness distribution, we can get the fit of the model on the current data set.

| Balancing factor
At the beginning of training, SPE tries to equalize the classification hardness of each bin. However, as the training process evolves, the population of 'simple' samples grows rapidly because the ensemble classifier will gradually fit the training set. In this situation, a lot of fitted samples will be retained if the selection is based on keeping hardness the same, leading to a classifier lacking of diversity. Therefore, a balancing factor α is introduced to decrease the probability of those samples along with the iteration.
We use the tan function in Equation (3) to control the growth of balancing factor α. Thus we have in the first iteration α ¼ 0 and in the last iteration α→∞, as shown in Figure 2.
where, i the index of iteration, and n is the total number of the iteration which is also equal to the number of classifiers in the ensemble.
When α goes large, we focus more on the harder samples instead of the simple hardness contribution. Through this mechanism, SPE gradually focuses on the harder data samples, while still keeps the knowledge of easy sample distribution in order to prevent overfitting.

| Self-paced ensemble
Integrating the concept of classification hardness and balancing factor, an ensemble algorithm SPE [43] is developed. Similar to boosting algorithms [37], SPE builds classifiers sequentially using undersampling, and obtains the final predictor by the summation of all classifiers. The main difference compared with other ensemble algorithms is the undersampling strategy.
To demonstrate the difference, boosting algorithm is briefly introduced. Boosting algorithm build classifiers sequentially as shown in Figure 3. It works by resampling subset of majority and adjusting the weights on the training instances adaptively according to the performance of the previous classifiers. Higher weights are assigned to each classifier for wrongly classified examples. The outputs are then updated using the weighted average approach. The final predictor is obtained by combining all classifiers. Boosting algorithm will be seriously affected by outliers in the late training period. These outliers are overemphasized, and even disturb the existing classification boundary, which makes the performance of the model worse.
Different from boosting algorithm, SPE adopts the distribution of classification hardness to resample the subset as the framework shown in Figure 4. The initial concept of 'selfpaced' is to incrementally involve instances into learning, where easy instances are involved first and harder ones are then introduced gradually [44]. The majority class (health data) are undersampled into balanced bins by keeping the hardness contribution in each bin the same in the early stage. Thus, the undersampling is guided to select training samples that contribute the most hardness to the current iteration. Balancing factor α is added as the weight when new bins are updated. It grows along with the iteration of training and determines the decreasing level of importance for samples.
The pseudocode of SPE is described in Algorithm I. Given the training set of fault P (minority) and the training set of health N (majority), SPE firstly trains f 0 based on P and a randomly selected subset N 0 . The initial hardness is obtained  1) i ¼ 0, train classifier f 0 using a random subset N 0 and P, where |N 0 | ¼ |P|. 10) Train f i using P and a newly undersampled subset N i .

12) Output: An ensemble classifier
After dividing the health data, undersampling is carried out to obtain a new subset N i for training. The subset N i is composed by samples that randomly select from each bin. The number of selected samples in lth bin is determined by a weight w l which is obtain by hardness and balancing factor as Consequently, the classifier f i for the ith iteration is trained on a balanced data set. In order to select samples that are most beneficial for the current ensemble, hardness value H and selfpaced factor α are updated in each iteration. They are used to divide the majority set N in the next iteration. After n iterations, all classifiers are trained and their summation F is the final ensemble classifier. The classification is determined by F, equivalently the average of all classifiers f i . In our work, decision tree is used for classifier f i .

| EXPERIMENTAL STUDIES AND ANALYSIS
In this section, experiments using interpretation methods, state-of-art imbalanced algorithms and SPE are compared. The test data is randomly selected from the combined data set, which is composed of 320 health and 260 fault samples. In the following experiment, the test data remains the same.

| Evaluation metrics
Evaluation metrics play a crucial role in assessing the classification performance. In previous studies, precision (or accuracy) is the most commonly used metric for transformer diagnosis [11]. However, using only precision is not sufficient to evaluate for class imbalance problems since it is highly sensitive to the data distribution. Generally, it is difficult to evaluate the performance of imbalanced algorithms using a single metric. In this article, we choose minority Recall and G-mean to be the evaluation metrics due to they are insensitive to the imbalance and commonly used evaluation for class imbalance problems [41]. Minority Recall shows the performance of the minority class, but F I G U R E 4 Flowchart of the self-paced ensemble. The majority is divided into bins based on hardness 548does not reflect any performance on the other class. G-mean is an overall performance metric that reflects how well the performance is balanced between two classes. The performance of the classification algorithms can be determined by the combination of two metrics, and great classification performance is obtained when both of the two metrics are high.
Assuming the minority class to be positive class (P) and the majority class to be negative class (N), classified samples can be separated into four groups as denoted in the confusion matrix given in Table 3. A confusion matrix consists of information about actual and predicted classification returned by a classifier.
Recall is defined as the classification accuracy on correctly identifying positive class, and it can be obtained by G-mean is defined as the geometric mean of recalls over both positive and negative classes. It is designed to measure the balanced accuracy between two classes. A low score for Gmean denotes a classifier that is highly biased toward one single class. It is defined as Gmean ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi

| Tests on interpretation method
DGA is the standard method for diagnosing transformer faults based on the gases generated due to the dissolver of the transformer insulating oil. IEC 60,599 [12] provides a list of DGA methods. Five typical gases are used as the diagnosing criterion for transformers, namely hydrogen (H 2 ), methane (CH 4 ), ethane (C 2 H 6 ), ethylene (C 2 H 4 ) and acetylene (C 2 H 2 ). Based on the concentration of these gases and the ratio between them, three famous DGA methods are established, namely, Rogers Ratios method [13,45], Duval Triangular method (DTM) [14], and IEC Ratios method [12]. These three interpretation methods are used to identify the status of the transformer on our test data. As can be seen from Table 4, these methods perform poorly on health samples. The recall is as low as 40%, and G-mean is around 60%. Since the region in DTM is designated for faulty status and no region is assigned to the healthy status, DTM cannot be used to identify whether the transformer is healthy or not, resulting in a G-mean of NaN (the number of health samples is zero). Due to their insufficiency, interpretation methods are not suitable for power transformer diagnosis when the healthy condition is considered.
It should be noted that the primary task for online transformer diagnosis is to identify if the transformer is healthy or faulty, which results in binary classification. After the fault is found, interpretation methods or AI algorithms are followed to further identify the type of fault. This work focuses on the primary task that only binary classification is considered.

| Tests on AI imbalanced algorithms
To test various AI algorithms, state-of-art imbalanced algorithms, as well as SPE, are evaluated for comparison. We compare seven popular oversampling algorithms, namely, SMOTE [46], Borderline SMOTE [47], Safe-level SMOTE [48], ADASYN [32], MWMOTE [33], CGMOS [49] and MAHAKIL [50]. A subsequent classifier is followed after the resampling. In order to compare various types of classifiers, support vector machines (SVM) [51], K-Nearest neighbours (KNN) and decision trees (DT) are used. For comparison, cost sensitive (CS) strategy is integrated with classifiers SVM, KNN and DT with the cost matrix of the inverse of the imbalance ratio are evaluated. These classifiers are implemented in Matlab 2019b and have been well optimized to deal imbalance by Matlab Inc. Meanwhile, we also evaluate seven ensemble algorithms as Bagging, SMOTEBagging [52], AdaBoost [37], SMOTEBoost [39], RUSBoost [40], EasyEnsemble [53] and BalanceCascade [53]. The base classifier of these ensemble algorithms is DT and the number of classifiers in the ensemble is 100.
To investigate the influence of health samples on diagnosis performance, the number of health samples (N N ) for training is increased gradually from 600 to 9000, where 600 corresponds to the balanced situation. The Recall and G-mean of these algorithms are listed in Table 5 for individual classifiers, Table 6 TA B L E 3 Confusion matrix for imbalanced problems

Positive prediction
Negative prediction  for resampling algorithms, and Table 7 for ensemble algorithms. The value in bold is the best in the row. The comparison result is summarized as follows.

| Balance is not a panacea
By comparing the Balance (N N ¼ 600) and data set with different numbers of health samples (N N ), it shows that a balanced data set cannot guarantee an accurate classifier. The best classification for resampling and ensemble algorithms appears at N NF ¼ 9000, where the imbalance ratio is the largest. Therefore, a balanced data set is not necessary for transformer diagnosis. As indicated in Refs. [42,43], imbalance is not the main source of classification difficulties. It will amplify the difficulties caused by the original data.

| More health data higher accuracy?
Previous studies about DGA mainly focus on fault classification. The health samples used in the previous research are limited and the influence of their appearance is seldom investigated. As the results in Tables 5-7 shown, the performance of classification algorithms increases with the growth of sample numbers, and reach the best performance at N NF ¼ 9000. Although no additional fault data is added into the datasheet, the diagnosis accuracy is still improved. This is because more data means a wider coverage of the information. Therefore, the diagnosis efficiency of the transformer can be improved by adding exhaustive health samples that contribute more information to the problem.

| Inefficient resampling algorithms
The oversampling can be regarded as a preprocessing technique and classifiers are followed for prediction. By comparing Table 5 (CS þ classifiers) and Table 6 (oversampling þ classifiers), it reveals that cost sensitive is more effective than oversampling algorithms when dealing with DGA problem. This is because oversampling generates new samples based on the distance among existing samples. However, it is hard to define distance on the DGA data set, especially when data fusion is adopted. Due to the serious overlapping and the small quantity, the minority lacks a good representation or clear distribution structure. In this case, oversampling methods (e.g. SMOTE and its variants) that directly rely on analysing the neighbour relationship among a few samples usually fail, or even be counterproductive. The newly generated samples may be noisy to the original data set. The performance of these oversampling algorithms can be improved by integrating the data information. MWMOTE and CGMOS show the best classification performance compared with other oversampling algorithms, because they identify the informative minority class samples before the generation of samples. They can effectively avoid the effect of noises during the data generation process. MAHAKIL obtains the worst result because it generates new samples based on evolutionary theory. New samples are generated simply by randomly among them and their bins. The algorithm provides more diversity compared with others. However, it also has a higher risk of generating noise.

| Ensemble algorithms are better?
As results in Table 7 show, Bagging and AdaBoost are superior to oversampling algorithms and other ensemble algorithms, except SPE. Bagging trains a model on resampled subsets, and takes the average. It cannot significantly reduce the bias, but can significantly reduce the variance. On the contrary, Boosting is to minimize the loss function sequentially, and its bias will gradually decrease. They all show a good performance on the data set in this work.
SMOTEBagging and SMOTEBoost are hybrid algorithms that integrating oversampling and ensemble. SMOTEBagging and SMOTEBoost perform worse than Bagging and AdaBoost, it indicates that SMOTE brings a negative effect on the ensemble classifiers, which is consistent with the previous discussion.
RUSBoost shows the worst performance of all. RUSBoost is an ensembles classifier integrating random undersampling and boosting. Undersampling will lose lots of information and shows a significant negative effect on the classification. Consequently, undersampling should be avoided in transformer diagnosis. EasyEnsemble is another type of undersampling based ensemble. It also shows a bad performance due to it will underfit the minority.
BalanceCascade iteratively discards majority samples that were well-classified by the current classifier. It may result in overfitting hard samples in late iterations and finally deteriorate the ensemble.

| SPE performs well
By comparing with other algorithms, SPE shows the highest recall and G-mean in most cases. It proves that SPE provides the best performance with different N N . SPE considers characteristics of the data into the training, and fills the gap between data sampling strategy and the classifiers' capacity. The issues in other algorithms as discussed above are all considered in by classification hardness. It reaches the best at the N NF ¼ 9000, where has the most data and the highest imbalance ratio. It demonstrates the capacity of SPE in dealing with imbalance and data fusion.  -551

| Efficiency comparison
To demonstrate the efficiency, the computation times for all algorithms are evaluated when the number of health samples is 9000. All these algorithms are run on a workstation with Intel CPU E5-2699 2.30 GHz and 32 GB memory. The average computation times for five runs are listed in Table 8. SPE ranks 12 out of 18 algorithms, and 5 out of 8 ensemble methods. SPE provides the best classification performance without losing much efficiency and is the best choice in this work. It also reveals that oversampling will unavoidably increase the computation burden compared with individual classifiers. MWMOTE performs best among oversamplers, however, consumes too much computation time. Undersampling algorithms such as RUSBoost and EasyEnsemble shows a fast training but a low accuracy.

| CONCLUSION
This article presented a comprehensive investigation of the DGA with an imbalanced data set from the view of both data set and methodology. To overcome the unsatisfactory diagnosis performance, large numbers of health data are collected to improve the classification. Since the imbalance and data fusion are introduced, SPE is employed, where classification hardness is proposed to consider the data characteristic in the classification. The performance of traditional interpretation methods and AI based methods including different classifiers and their interaction with data are thoroughly investigated. The experiment reveals that: 1. The diagnosis efficiency of the transformer can be improved by adding exhaustive health samples with using suitable imbalanced algorithms simultaneously. The quality of the data is the fundamental reason for the diagnosis performance. More data means a wider coverage of the information. 2. Cost sensitive is more effective than oversampling algorithms when dealing with data fusion problem because oversampling may generate wrong samples. These wrong samples may mislead classifiers and result in a negative effect on the diagnosis.
3. Though widely used to handle imbalance problems, undersampling algorithms have a negative effect on the transformer diagnosis. Therefore, undersampling based algorithms should be avoided when dealing with transformer diagnosis. 4. Ensemble algorithms (Bagging, Boosting, etc.) show a good performance on the transformer diagnosis especially SPE. SPE shows the best performance among all because it takes account the distribution of the data into the classification. This feature is important when data fusion is involved. Therefore, SPE is recommended and more data should be involved in the transformer diagnosis