Improved information entropy weighted vague support vector machine method for transformer fault diagnosis

National Natural Science Foundation of China, Grant/Award Numbers: 51777117, U1966209 Abstract A combined model based on improved information entropy and vague support vector machine (IVSVM) is introduced into transformer fault diagnosis using dissolved gas analysis in oil (DGA). The improved information entropy method is used to obtain the weights of each gas and to weight the raw data, and the processed training data and the corresponding fault types are inputted into the vague support vector machine (VSVM) model to obtain classifiers. Firstly, the training data are weighted by the improved information entropy method to discretise the original data from themixed state for subsequent classifier training. Then, the vague set divides the events into true, false and unknown factors, which can optimise the sub‐interface of SVM and improve the accuracy of the boundary point classification. Finally, fault data from the literature and actual collections are selected for training and testing. By comparing with the widely used ratio method and artificial intelligence method, it can be concluded that the method described herein can effectively improve the accuracy of fault diagnosis. The result shows that this method has better applicability when facing actual fault type classification with higher data similarity.


| INTRODUCTION
Power transformers are very important power equipment and their stable operation is of great importance to the power supply. During operation, a transformer fault can occur due to environmental factors, human factors and other reasons. A small amount of gas is dissolved in the transformer insulating oil, and the composition of the gas in the transformer insulating oil changes more significantly when a fault occurs. The concentrations and relative proportions of by-product gases are closely related to the transformer insulation conditions [1].
Dissolved gas analysis DGA of oil can cover most transformer failure scenarios. The gases used for this analysis are hydrogen, methane, ethane, ethylene, acetylene, carbon monoxide, and carbon dioxide [2][3][4][5]. CO and CO 2 are produced due to the decomposition of cellulose in solid insulation and usually vary significantly in concentration at high temperatures and overheating, but not in other fault conditions. Therefore, conventional ratio methods largely ignore the use of CO and CO 2 [3]. The content and proportion of these gases vary depending on the fault conditions. For example, partial discharges in discharge failures will typically manifest as a surge in H 2 content. Low-energy discharges and high-energy discharges are accompanied by C 2 H 2 production. The percentage of C 2 H 4 and CH 4 is different for different degrees of overheating faults. Some scholars have proposed improved ratio methods to address the problems with the ratio method. For example, ratio methods involving more parameters have been proposed [6,7], which divide the ratio conditions more carefully and can effectively improve its accuracy. Others have proposed new display methods, such as the pentagonal method [8,9]. In addition, machine learning methods have been applied to fault diagnosis and achieved good results. For example, neural networks [10,11], fuzzy set theory [12], Bayesian networks [13], set pair theory [14], support vector machines [15,16], Parzen window [17] and other methods [18].
Herein, an improved vague support vector machine method (IVSVM) using the information entropy method is used. The information entropy method is used to process different classes of data to make them easier to distinguish, and then the VSVM method is used to classify transformer fault types by taking advantage of the higher accuracy of data classification at the boundary. Five gases, hydrogen, methane, ethane, ethylene, and acetylene, are chosen as the characteristic quantities for the study.
Open source fault data containing six fault types and actual collected operational data containing nine fault types are used for the training and testing data.
SVM as a shallow classifier has the advantage of high classification accuracy and fast classification speed. In addition, the number of selected features is moderate, all of them are gases and they have a high correlation with each other (all of them are decomposition products in transformer oil). The use of deep learning methods, such as deep neural networks, can lead to large time and space complexities, which can affect the effectiveness of learning classification. It is known from the analysis that the method described herein has better performances than traditional ratio and machine learning methods in fault classification. It also has better classification performance when facing poor data quality, and can effectively classify faults with similar data when the fault type is subdivided.
Section 2 introduces the method and flowchart presentation of the improved algorithm; Section 3 is a combination of data analysis results, selected open source data collected from the literature, and actual transformer operation data collected from a place in southwest China for algorithm performance proof; and Section 4 gives the conclusion.

| Vague set
Vague sets theory [19] is a complement to fuzzy sets, which describes the object of study in terms of both true and false. Vague sets can better describe uncertainty and compensate for the lack of a single subordination function in fuzzy sets.
Definition: Let U be a space of points, with a generic element of U denoted by x.
A vague set V in U is characterised by a truth membership function t v and a false membership function f v . t v is a lower bound on the negation of x from the evidence against x. t v (x) and f v (x) both associate a real number in the interval [0,1] with each point in U and The above method can constrain x i to take values in 1]. Its vague value consists of three parts: truth membership function t v (x), false membership function f v (x), and unknown part m v (x). The method of getting t, m, f of x is introduced in Subsection 2.2.

| Vague support vector machine
In this subsection, the VSVM model is first introduced and then the variable calculation method is introduced in detail.
The support vector machine [20,21] is a sub-classification model that aims to find a hyperplane to partition the sample. The principle of segmentation is spacing maximisation, which ultimately translates into a convex quadratic planning problem to be solved. When the training sample is linearly divisible, a linearly divisible support vector machine is learnt by maximising the hard spacing; when the training sample is linearly indivisible, a nonlinear support vector machine is learnt by maximising the kernel trick and the soft spacing. The model is as follows: where w is the hyperplane normal vector, C is the penalty factor, n is the number of samples, ξ is the relaxation factor that represents the allowable error score in the linear indistinguishable case, y is the sample output, x is the sample input, and b is the threshold. Equation (2) is the constraint, and the Lagrange multiplier algorithm is introduced to solve the above problem to obtain the optimisation objective function. The radial basis function is chosen as the kernel function of the support vector machine. The equation is as follows.
where g is radial basis kernel function parameter, the value of which has a large impact on the prediction accuracy of the regression model. The larger g is, the stronger is the influence between the support vectors. The SVM method relies heavily on penalty and relaxation factors for boundary points, and is prone to misclassify the points between y i (w i x i + b) = 1 and the sub-interface. Herein, the VSVM method is proposed, which combines the vague set and SVM, to address this problem. The vague set can be used to measure the truth and false membership of things to improve the accuracy of the SVM classification of boundary points. The schematic diagram of the VSVM method is shown in Figure 1.
As shown in Figure 1, two types of faults are defined as F 1 and F 2 for X and dot, and the remaining fault is defined as Other Faults. The point in F 1 marked with green represents a case. t is the probability that the point belongs to fault F 1 . f is the probability that the point belongs to F 2 . m is the probability that the point belongs to Other Faults. The VSVM model is shown below: where w is the hyperplane normal vector, ξ i is relaxation factor. t, f, m are the true dependence, false dependence, and unknown part of the vague set. In Figure 1, we get three sub-interfaces to build VSVM classifiers: F 1 and F 2 , F 2 , and Other Fault, F 1 and Other Faults sub-interface. Therefore, the value of Q+ and Qfor these three sub-interfaces are t and f, f and m, t and m. Q+ and Q-are the correction factor of ξ. The smaller Q is, the smaller the role played by ξ and hence the less important x is. The points outside the support vector have less influence on the boundaries and it is easier to obtain more a reasonable boundary. Set the Lagrange multiplier α,β and the problem is transformed as follows: The bias derivatives of w, b, and ξ are obtained using the Karush-Kuhn-Tucker condition. The equation is as follows.
Substituting the above results into L yields The solution of these parameters can be achieved using the SMO method [22]. The final solution obtains the support vector w.
The classification function is The parameters t, f, m of the VSVM in Equation (6) are obtained in combination with the KNN algorithm. KNN is the k-NearestNeighbor method, which is a common clustering method. The acquisition steps are as follows: (1) Define F i as fault type i, X ij as the j-th data of fault type i in the sample. Select two types of faults from all the fault types to build a support vector machine model. For example, choose two faults in all fault types as F 1 and F 2 .
The data corresponding to fault F 1 and fault F 2 are (2) Select a point in D 1 , such as X 11 , and use the KNN method to obtain the k-neighbouring points it contains. The value of k is usually less than the square root of the data volume and herein there are three classes when building one VSVM classifier (k is at least 1), so k is an integer and k ∈ [1, ffi ffi ffi n p ]. Subsection 2.4 introduces the selection of k value. An example is given to illustrate this, as shown in Figure 2.
The point marked with green is X 11 and its k-neighbouring points are circled with a dotted line. The number of points belonging to fault F 1 , fault F 2 and other faults are n 1 , n 2 and F I G U R E 2 A sample of getting t, f, m by using KNN. KNN, k-NearestNeighbor ZHANG ET AL.
The point X 11 parameters t 12 , f 12 , m 12 are as follows: Therefore t 12 = 0.666, f 12 = 0.167, m 12 = 0.167. Besides, k is a variable, so the value of the value of m changes when k changes. That makes m change from 0 to the maximum of m, and t changes from 1 to the minimum of t. Therefore, this method of getting t, f, m satisfies the Definition in Subsection 2.1.
(3) Get the t, f, m parameters of each data item by the above method and build nðn−1Þ 2 � 3 VSVM classifiers by using Equations (4)-(10), n is the number of the fault type.

| Obtaining data weights by improved information entropy method
Before the data can be used, they need to be processed for more accurate results. The gases are normalised in the following way.
It can be seen from the data shown later that the volume share of H 2 among the five gases may be high. When normalising the four gases other than hydrogen using the method of Equation (14), the addition of hydrogen may make the volume share of other gases too small and lead to missing information. When classifying actual faults with the same fault type, the normalisation method of the four gases in Equation (14) can highlight the differences.
The training data for fault type F are X (F) , X (F) is a matrix of p rows and q columns. p is the number of the cases, q is the number of features, and q is the gas type used here, that is, five types. X (F) is shown below.
Entropy is a physical concept in thermodynamics and is a measure of how chaotic a system is. The higher the entropy, the more chaotic the system is, and the less information it carries. The lower the entropy, the more ordered the system is and the more information it carries. Information entropy [23] draws on the concept of entropy in thermodynamics to describe the average amount of information in a source. Defining e ðFÞ j as the information entropy when the fault is F, e ðFÞ j is calculated as in Equation (16).
The classification herein is based on the different percentages of various gas contents in different fault conditions. For example, the proportion of ethylene content in low-energy and high-energy discharges far exceeds that in thermal and partial discharge faults; the proportion of hydrogen content in partial discharge faults far exceeds that in other types of faults.
Define ω ðFÞ j as the weight of the j-th gas of fault type F, defined as in Equations (17)-(19): Equation (17) is the coefficient of variation in the entropy method. The larger g, the more scattered the data distribution. Herein the weight of a gas is considered in a fault type. Therefore, the more concentrated the volume fraction of the gas is distributed, the more representative it is of that fault type, that is the greater the weight of the gas. In summary, the definition of g here is opposite to that of the entropy value method, and therefore the inverse is taken here, as shown in Equation (18). The weighted vector P (F) of fault F is defined as Equation (20):

| Algorithm flowchart
According to the description above, the flow chart of the IVSVM method is shown in Figure 3. In conjunction with Figure 3, the steps of the IVSVM approach described here are as follows: (1) Normalising the gas content of the raw data. The normalisation method can be referred to Equation (14).

| Open source data analysis in international standards and literature
A total of 342 cases from the literature [24][25][26][27][28][29][30] were used here for testing. PD is partial discharge, LED is low energy discharge, HED is high energy discharge, T1 is low temperature overheat, T2 is medium temperature overheat, and T3 is temperature overheat. The types and numbers of fault data are shown in Table 1.
The IEC three-ratio method, Rogers method, Duval triangle method, SVM method and BPNN (BP neural network) method were used for comparison with the algorithms described here. The number of cases correctly predicted by each method for each fault type is shown in Figure 4 and the accuracy of each algorithm is shown in Figure 5. Figure 4 shows the number of cases correctly predicted by each method for each fault type. The horizontal axis refers to the type of fault and the vertical axis refers to the fault types. The dark blue bar labelled "testing data" refers to all the data used for testing (Table 1, testing data). The other bars represent the number of cases diagnosed correctly by each method for each fault type. From Figure 4, the total number of testing data is 117 and the correct prediction number of the IEC  Figure 5, the accuracy of the IEC three-ratio method, Rogers method, Duval triangle method, SVM method, BPNN and IVSVM are 70.9%, 77.8%, 84.6%, 85.5% and 90.6%. When the data quality is low, that is, there are more boundary points, the accuracy of the IEC three-ratio method and Rogers method is lower. The Duval triangle method shows similar accuracy to the SVM method, because the classification boundaries are more detailed and more realistic. SVM, BPNN and IVSVM have higher accuracy rates than the traditional ratio methods. The main reason for this is that the classification of boundary points is more realistic and less likely to be misclassified.
Selecting part of the data with different fault types and using different algorithms for analysis, the diagnostic results are shown in Table 2. Data with an asterisk (*) are fault data with poor quality. The results are varied and do not match the actual faults when analysed by different methods. Data with two asterisks (**) are of average quality. This data are usually on the boundary between different types of faults and different methods may give different classification results but will be like the actual type of fault. The rest of the data in Table 2 are fault data with good quality, the results of the analysis of these data using different methods are basically the same and consistent with the actual failure.
It can be seen from Table 2 that the IEC three-ratio method and Rogers method have lower accuracy when the quantity quality is poor, and the Rogers method also has some data that cannot be judged because of the boundary conditions. The Duval triangle method has higher accuracy due to more detailed classification and more complex classification conditions. SVM, BPNN and IVSVM methods have higher accuracy than the traditional ratio methods and have better results in dealing with data that are prone to misclassification. IVSVM works better when the data quality is poor.

| Analysis of actual operational data
Considering the practical applicability of the method, the algorithm shown here was used to train and test 630 of the collected fault data in southwest China. The data were collected from the data management system of China Southern Power Grid. These are real faults in actual operations and they were verified by experts and the inspection companies. For convenience, the fault codes A-I are used to indicate the fault conditions. The types and numbers of these faults are shown in Table 3.
Nine types of faults are included within the 630 cases. Each type of fault data has 70 cases. Fifty cases of each fault were used as training data and the remainder were used as testing data. A portion of the data is selected and optimised by weighting using the improved information entropy method. The data corresponding to various types of faults are shown in Table A1. The gas weighting factor for each fault is shown in Table 4.
The effect of information entropy weighting is shown for this selected part of the data. Since the data have five characteristic quantities, three of them, CH 4 , C 2 H 6 and C 2 H 4 , were selected for display as shown in Figures 6(a)  It can be seen from Figure 6(a) that the data of the same kind as the raw data are scattered before processing and that there is mixing of different classes of data. Classes A, C, and F are mixed, class I has a point at the top that is very far from the other points, and class E also has a point that has this edge point situation. Therefore, when using the data in this case to classify or cluster directly, misclassification will occur. A different perspective can also reveal that these problems do exist, rather than the angle of observation, as shown in Figure 6(b). Figure 6(c) shows the data after the weighting process. The data become concentrated and the boundaries between the different classes of data are clear. In Figure 6(a), the mixing of classes A, C and F is effectively improved, and the problem of classes I and E is also solved. There is also F I G U R E 5 Accuracy of each method F I G U R E 4 The number of cases correctly predicted by each method for each fault type, HED, high energy discharge; LED, low energy discharge; PD, partial discharge a mixing of the observation classes G and I at this angle, but a different angle reveals that this is not the case, as shown in Figure 6(d).
After observing the weighted optimisation results of content CH 4 , C 2 H 6 and C 2 H 4 , C 2 H 4 was selected as the reference standard to analyse the weighted optimisation effect of the two remaining gases (C 2 H 2 , H 2 ). It can be observed from Figure 6(e) that there is a mixture of classes C and F, and classes G and I. After optimisation, it can be observed that the data shown in Figure 6(f) are more compact within classes and the class spacing is obvious. The improved information entropy method used here has an obvious effect on the discretisation of data, which is of obvious help in the subsequent classification work.

T A B L E 3 Types and numbers of actual operational data
Duval triangulation is ineffective when classifying multiple fault types, so SVM, BPNN and IVSVM were chosen for comparison. The learning rate of BPNN is 0.01 and the number of iterations is 500. The parameters of SVM, BPNN and IVSVM were optimised using the particle swarm algorithm. The maximum number of evolutions is 300, the maximum number of populations is 20, update rates of parameters C and g and BPNN's w and b are all set to 1.1. The best C and g for SVM are (37.68,105.78) and for IVSVM (14.69,73.59). Figure 7 shows the best accuracy changes and the average adaptation rates of three methods. Table 6 shows the accuracy of the testing data and training data for the three methods, SVM, BPNN and IVSVM.
Combining Figure 7 and Table 6, the training accuracy of SVM is 87.3% and the test accuracy is 84.4%; the training accuracy of BPNN is 89.1% and the test accuracy is 86.7%; and the training accuracy of IVSVM is 94.2% and the test accuracy is 91.7%. IVSVM has more improvement than SVM and BPNN due to the pre-processing of the data for better quality and the ability of VSVM to better classify the boundary points. From the diagnostic results of all three methods, the accuracy of the SVM and BPNN methods is similar. When facing the classification of different types of faults, sometimes SVM is better than BPNN, and sometimes BPNN is better than SVM, which proves that both methods have their advantages. IVSVM outperforms the other two methods in classifying all types of faults. The accuracies of both training and testing are higher than the other two methods for misclassified classes (e.g., fault type F), proving that IVSVM is more effective in the case of mixed data in multiple fault classifications.

| CONCLUSION
Transformer fault monitoring is an important part of power system operation. An improved vague support vector machine method is proposed herein. The method uses an improved information entropy method to weight the dissolved gases in transformer oil and then builds multiple vague support vector machine classifiers for transformer fault diagnosis. The important findings are as follows:

F I G U R E 7
Changes of the adaptation rate and best accuracy of the three methods using a particle swarm algorithm. BPNN, BP neural network; IVSM, information entropy and vague support vector machine; SVM, support vector machine (1) Five types of dissolved gases in transformer oil are selected as the characteristic quantities for transformer fault diagnosis. The improved information entropy method is used to obtain the weights of the training data and to weight the training data and the test data. (2) VSVM is used to divide the weighted data into three parts and build multiple one-to-one classifier models. The parameters of VSVM are obtained by the KNN method and the particle swarm method for optimisation. VSVM can ensure the accuracy of one-to-one classification while coping with the problem of easy misclassification of interclass points. (3) A total of 342 data from collected open source international data sets and 630 actual transformer fault data from a location in southwest China were used to test the method proposed herein. By comparing the results with various ratio methods as well as SVM and BPNN methods, it was found that the proposed method paper has a higher accuracy. In the face of more types of faults in actual operation and the characteristics between classes not being obvious, this method can better classify the faults. It plays a certain role in transformer fault monitoring in a region in southwest China, and has a more important guiding significance for the actual operation and maintenance work.