Method of inter-turn fault detection for next-generation smart transformers based on deep learning algorithm

: In this study, an inter-turn fault diagnosis method is proposed based on deep learning algorithm. 12-channel data is obtained in MATLAB/Simulink as the time-domain monitoring signals and labelled with 16 different fault tags, including both primary and secondary voltage and current waveforms. An auto-encoder is presented to classify the fault type of the abundant and comprehensive fault waveforms. The overall waveforms compose a two-dimension data matrix and the auto-encoder is trained to extract the features in the multi-channel waveforms. The selected features are convoluted with the original data, generating a one-dimensional vector as the input to the softmax classifier. Variables such as type, activation function and depth of auto-encoder, sparsity of sparse auto-encoder, number of features and pooling strategies are studied, which gives an intuitive process to train a proper learning model. The overall recognition accuracy reaches 99.5%. Signal characteristics such as channel selection, time span of the input signal and signal sampling frequency are studied to find the best solution for the inter-turn fault detection of the three-phase transformer. The proposed method under deep learning framework increases the accuracy and robustness in transformer fault diagnosis, indicating its potential and prospect in the next-generation smart transformers.


Introduction
Power transformer is one of the most essential and expensive equipment in the power transmission and distributed system. As the power rate per transformer grows higher and higher, loss of power transformer not only causes an interruption in power supply but also leads to disruption on the stability of the whole system. Therefore, it is very important to detect incipient inter-turn fault [1]. In general, detection of inter-turn short circuit fault can be classified into two categories, online monitoring and offline test.
Compared to online method, though offline test of a transformer has to be performed while it is out of service, its accuracy and reliability is much higher than the online monitoring. Winding resistance test can evaluate the current winding status, which is able to find out the existence of inter-turn fault [2,3]. Partial discharges test is also capable of evaluating insulation status. Partial discharge is the early stage of the insulation break down which means short circuit fault can be discovered on its occurrence using this method [4,5]. After the internal short circuit happens, electromagnetic force would cause winding deformation and frequency response analysis (FRA) is very sensitive to this [6]. Accuracy and application of FRA method have been extensively discussed in [7][8][9].
Online monitoring and diagnosis is the future trend in this field, since fault can be discovered as early as possible and the detection process will not cause power failure. Differential protection and digital relay is the classical and robust method to solve this problem. Several modifications are proposed in [10,11] based on this method to make this method adapt to different application scenarios.
As the nature of the inter-fault detection of the transformer is a classification mission, machine learning technology seem to be very suitable for this application. Many researchers have tried to find signals containing fault information and applied them to existing machine learning classifier. The authors in [12,13] make use of the differential current and put it into the particle swarm optimisation based probabilistic neural network. Wavelets are used in [14,15] to do feature extraction on the neutral current. A novel DGA technique based on Parzen window estimation has been presented in [16].
Despite of the advantages of the above-mentioned methods, they all contain limitations. All off-line methods can only be conducted during the maintenance. Differential protection and digital relay diagnostic system has a higher reliability and can be realised by online monitoring. However, setting of the threshold value between normal and fault status is troublesome. All the machine learning methods are based on a shallow model which means the model can only do the simple job of classification. The extracted feature is designed by human. This feature designing process is the most time-consuming task and requires many experts' opinions.
In this paper, a novel fault diagnosis system is proposed based on the deep learning (DL) method auto-encoder (AE) designed by Hinton and Salakhutdinov [17]. With the powerful DL tool, features can be extracted automatically without much expertise. Voltage and current waveforms of both primary and secondary side are put together to form a multi-channel signal matrix for the proposed system that is able to diagnosis 15 types of inter-turn short circuit fault. Section 2 introduces the MATLAB/Simulink simulation system and how to pre-process the data to transform the original fault signal to the data block to be learnt. Section 3 presents the proposed classification framework and its basic principle. Section 4 shows the parameter tuning process to further promote the performance of the framework. Section 5 studies the influence of input data frequency and channels.

Simulation system layout
Several simulation methods of internal winding fault have been proposed, some of which are based on circuit theory and others are based on electromagnetic theory. Since only current and voltage signals are necessary in this paper, circuit model is enough and easy to carry out. In [18], a transformer model based on primary and secondary coil has been proposed, which is capable of simulating faults between any turn and the earth or between two turns of the transformer windings. The simulation is implemented in MATLAB/Simulink.
A simple three-phase system is illustrated in Fig. 1. Considering the convenience of laboratory experiment in the future, these two transformers are both modelled based on real transformer structure. The first transformer is oil-immersed, which is utilised to boost the three-phase 380 V to 10 kV. The second transformer is the dry-type one to be deployed faults. The whole system is 50 Hz. The open circuit and short circuit test results of the two transformers are list in Table 1.

Fault deploy
In this paper, a total of 15 kinds of faults will be studied and they could be summarised in three categories, which are modelled in Fig. 2. In Simulink, primary and secondary winding modules can be used to build a simple transformer, as shown in Fig. 2a. Pink line denotes the magnetic circuit while the blue line stands for the electrical circuit. All the windings are in one close magnetic loop, which means the mutual inductance is considered in this model which accords to the theory induced in [18].
When a single winding to ground fault occurs, short-circuit point 1 split the winding into two parts, shown as Fig. 2b. In the same way, when there are two short-circuit points in the fault, such as double winding fault and single winding fault, shown in Figs. 2c and d, respectively, eight wingdings are used in the simulation. By tuning the winding turns, different position of short-circuit can be obtained.
In Fig. 2b, only high-voltage winding to ground fault is simulated because insulation between low-voltage winding to ground is not strong enough to breakdown the system. Thus, three kinds of faults, A-g, B-g and C-g are simulated by this model.
Considering double winding fault, inter-phase low-voltage winding will not be taken into consideration because the lowvoltage winding is covered by the high-voltage one. For the same reason, interphase fault between different sides of the transformer, such as A-b, will not be simulated either. A-a, B-b, C-c, A-B, B-C and C-A, a total of six kinds of fault data are obtained in Fig. 2c. There are six more faults circumstance simulated in Fig. 2d, such as A-A, B-B, C-C, a-a, b-b and c-c. By tuning the different parameters of the system, a total of 61,225 records of raw dataset are generated. Specific parameters such as fault position, load angle, load magnitude and fault occurrence angle are listed in Table 2. System frequency is 50 Hz.

Pre-processing
Each record of raw data contains a 12-channel waveform with 0.5 s, which can be seen as a 12 × l raw matrix. U A , U B U C , I A , I B , I C , U a , U b , U c , I a , I b and I c correspond to the 1st to 12th row of the data record. A time span of 0.5 s contains both pre-fault and postfault information. Zero degree of fault occurrence angle is aligned to 0.38 s. Only post-fault data fragment is of interest, so each record is cut into a t seg segment with only post-fault data. The beginning time is randomly selected from 0.4 to 0.42 s. Since the number of A-g, B-g, C-g and no fault data are far less than that of others (i.e. 1200 and 25), the final number of data segment is multiplied to 4800. The dataset is finally comprised of 76,800 records and each records is a 12 × l seg matrix. Suppose that the  P 0 and P k refer to iron and copper loss, respectively; I 0 and U k refer to no-load current and short-circuit impedance, respectively. segment time t seg = 0.02 s, which corresponds to 1 cycle, and the sampling frequency is 20 kS/s, l raw = 10,001 and l seg = 401. The pre-process of A-A fault data segment is illustrated in Fig. 3. Applying this process to the dataset to all the 16 kinds of fault, the final dataset for the fault detection is a 12 × l seg × 76,800 data cube.

AE for feature extraction
AE is a useful tool to extract features based on DL framework [19], which has three components, visible layer, hidden layer and reconstruction layer, respectively. Generally, the size of hidden layer is smaller than that of the input data. The unsupervised training process tries to make sure that the output vector is the same as the input one. In this way, the hidden layer will successfully extract the compressed feature representing the original input signal. Therefore, only half of the AE, input layer and hidden layer, will be used during application and the features are extracted by the weight matrix of the hidden layer. In [19], a sparse AE (SAE) is used to extract features of the various types of ferroresonance overvoltage.
Given a set of input data {x (1) , x (2) ,…, x (N) }, x (i) ∈ R n , i = 1, 2, …, N, this input data will be mapped to the hidden layer h l , l = 1, 2, …, m using where W 1 ∈ R m × n is the weight matrix connecting the visible and hidden layer and b 1 ∈ R m is the bias vector. N is the total number of training vector x i ; n is the dimension of each input vector; m is the size of hidden nodes. In (1), function f() is called the activation function. Typically, it is sigmoid function which is defined as There are also other activation functions, such as tanh, ReLu, ELU and PreLU [20]. Evaluation of the effect of these activation function will be discussed in Section 4. After encoding, the second step, decoding, is completed by the reconstructing layer using where W 2 ∈ R n × m and b 1 ∈ R n are also called the weight matrix and bias vector associating output layer with hidden layer. The accuracy of a simple AE is assessed by the cost function defined as follows: In a simple way, training a SAE means the optimisation of J AE . When J AE is optimised to an expected small value or a certain number of iterations is reached, the training process will be completed and W 1 is the feature learnt from the dataset x automatically. Several improved AE have been proposed in recent years such as SAE and de-noise AE (DAE) [19]. The difference between the original AE and improved AE is adding an additional penalty term in (4) to adapt AE to different applications.
In the case in this paper, the process of AE application of a multi-channel-signal learning is shown in Fig. 4. In Section 2.3, all the dataset are transformed to a 12 × l seg × 76,800 data cube, which means there are 76,800 records in the dataset and each of them is a 12 × l seg matrix. AE is trained by n patches patches cut from this matrix. A 12 × l p × n patches patch cube is randomly cut out from data cube, where l p is defined as the patches length in the training process. Since each layer of AE is a one-dimension vector, each patch matrix is stack to a vector, the length of which is 12l p . Thus, input training patches is a 12l p by n patches matrix. Supposing that there are m hidden hidden nodes in the hidden layer, transforming matrix W 1 , the extracted feature, is a m hidden by 12l p matrix. The feature W 1 matrix can be reshaped to m hidden 12 by l p small feature matrixes. Moreover, these features can be visualised using greyscale images by mapping the matrix value to an interval of 0 and 255. By setting l seg , l p and m hidden to 401, 30 and 40, a preliminary AE could be trained, the optimisation method of which is limited memory Broyden-Flether-Goldfarb-Shanno (L-BFGS) [21]. Under most conditions, L-BFGS has the advantages of higher convergence speed and lower cost of computing resources.
The visualised 40 features are shown in Fig. 5. The darker the pixel is, the greater the value of the corresponding position in the matrix will be. Most of the features are in a moderate value, yet there are some exceptions. For example, features 9, 10, 17, 18 and 36 have more bright and dark pixels, showing a rapid fluctuation of the waveform, which will be activated when there is a sudden change in the input data.

Convolutional and pooling layer for classification
After training AE, the following convolutional and pooling layers help to transform the original data into a one-dimension vector used to train the softmax classifier. The whole process is illustrated in Fig. 6. This multi-channel convolution neural network is used by Zhang et al. [22] and Chen et al. [23] to solve multi-channel signal problem, which is called as convolutional AE (CAE) method or convolutional sparse AE (CSAE) method. Each original segmented record is a 12 × l seg matrix and each extracted feature (shown in Fig. 6 as a greyscale picture in a frame with different colours) is a 12 × l p matrix. At the beginning, the feature is aligned to the left margin of the segmented record and a dot product is calculated with all the elements overlapped, which will be the first element of the horizontal vector. Then, as the feature moving to the right margin of the segmented record, dot products will be obtained one by one that will complete this horizontal vector. The size of this convoluted vector is l seg − l p + 1. After repeating this process using the rest of the features, a m hidden by l seg − l p + 1 matrix will be obtained.
Dimension of the original segmented data is transformed from 12 × l seg × 76,800 to m hidden × (l seg − l p + 1) × 76,800 through convolutional layer. Generally, if the convoluted data is directly sent into the softmax classifier, the input data size would be so large that the model is easy to overfit. Thus, a pooling layer is used to prone the data followed by the convolutional layer. The commonly used method is max-pooling and mean-pooling. The basic theory is the same. In this paper, pooling is operated on each horizontal convoluted vector, so the pooling size is 1 × s p . s p is defined as the pooling size. The vector will be divided into several disjoint segments and the length of each segment is s p . It should be noted that it is not necessary that (l seg − l p + 1) have to be divisible by s p , because the remainder element can be discarded. For maxpooling, the max value of the s p -length divided vector is chosen as the element in the new vector. For mean-pooling, the mean value of the s p -length segmented vector is calculated as the new element. The final size of the pooled data is m hidden × {(l seg − l p + 1)/s p } × 76,800. In this paper, {} denotes the rounding operator towards minus infinity. Fig. 7 presents the data flow chart and their corresponding dimension.
The last layer of the framework is the softmax classifier. It is based on softmax regression model, which is a generalisation of logistic regression model and could deal with multi-class problems [24]. When softmax is applied to a q-class classification problem, the output vector of the classifier is made up with all q probabilities. Thus, the input vector will be divided into the category with the high probabilities. Given a group of labelled input vector {x (1) , x (2) ,…, x (k) }, x i ∈ R n and y i being the corresponding label, the probability that x i belongs to a certain class j, j = 1, 2, …, q can be calculated as follows [20]: Since the input data of softmax has to be a one-dimensional vector, pooled m hidden by {(l seg − l p + 1)/s p } matrix has to be stack sequentially to a vector. 76,800 records are divided into two parts, training data and validating data, with the ratio of 0.7 and 0.3.

Hyper-parameter selection of the proposed method
Although the basic framework has been proposed in Section 3, some details should be determined step by step. AE type and its activation function, sparsity, depth, hidden nodes number and pooling strategy will be discussed one by one.

Choosing the type of AE and its activation function
SAE and DAE are two common varieties of the original AE. In SAE, sparsity parameter ρ is defined to control the average activation level of all hidden nodes. For example, if ρ is set to 0.1, only 10% of the hidden node is activated. DAE is generally used to make the model more robust to the noise. In this paper, all the data is based on Simulink simulation, noise is not taken into consideration. Thus, only SAE and original AE will be adopted. The activation function enables a DL model to approximate to a non-linear model instead of a simple linear one, which makes the model more adaptive. In the former training process, sigmoid function is used as the default activation function, which is one of the most widely used activation functions at present. Several concerns are aroused by the application of sigmoid and the most important one is gradient vanishing. Relu is another kind of function that is widely used in DL realm [25]. It gives an output x if x is positive and 0 otherwise and this output characteristic will bring two advantages. For the input x > 0, the derivative of the function is a constant which will not cause vanishing gradient during the backpropagation and computation load is much less than sigmoid. For input x < 0, the output of the function is set to 0 and the sparsity of the network is therefore achieved directly.
Three types of AE with different kinds of activation function are trained in this section, such as AE with sigmoid, AE with Relu, SAE with sigmoid. Number of hidden nodes is set to 40, and sparsity ρ is 0.1. Classification accuracies of different types of AE are listed in Table 3. Training process of each type of AE is shown in Fig. 8.
From the classification results it can be concluded that SAE with sigmoid performs better. Except for fault 4, all other 14 kinds of fault can be 100% identified. It should be mentioned that the no fault data (fault 16) does not perform very well compared to others. By checking the confusion matrix, it can be discovered that this kind of data is easy to be misclassified to fault 2, 6 and 10, which is the three kinds of low-voltage single winding fault, a-a, b-b and cc. The learning process shown in Fig. 8 proves that AE with Relu is more efficient than sigmoid. However, under this size and dimension of the training data, the difference among them is not

Sparsity parameters
Sparsity ρ is the key value that will decide the classification performance of SAE. In this paper, ρ is set to 0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 and 0.7. Typically, ρ is tend to be a small value [19] according to the fact of the bionics that only a little of neurons is activated when human brain receives an outside stimuli. So, the step of ρ is below 0.1. The classification result is illustrated in Fig. 9.
When the sparsity is too small, for example 0.02 and 0.04, the accuracy of the classifier decreases to 96-98%. As ρ increases to 0.2, the highest validating accuracy is 99.43%. However, further increasing ρ from 0.4 to 0.7 will lead to a slight decline in the accuracy. Thus, sparsity is set to 0.2 in later training process.

Depth of AE
In [26], Hinon mentions that multiple-layer AE might make a feature extraction more effective, and it should be trained through top-down structures. Supposing a 360-200-40 SAE is going to be trained, the first 360-200 AE should be trained and the 200-40 SAE will be trained right after it. According to (2) Let us define W and b as the overall weight matrix and bias vector and it can be easily derived that It is obvious that W and b are still a 40 by 360 matrix and a 40 by 1 vector, the dimension of which is the same as that of the 360-40 SAE. In order to evaluate the influence of depth of AE, two multiple-layer SAEs, also called as the stacked SAE (SSAE), with different number of hidden nodes are implemented. The result of them is compared with that of single-layer SAE. All of the sparsity is set to 0.2. Under the circumstances in this paper, a final structure of 360-40 double layers will be enough. Table 4 lists the comparison of different depths of SAE. The optimisation process is shown in Fig. 10. Solid and dashed lines in the same colour indicate the two steps of training a double-layer SAE. This double-layer structure of SAE improves the overall accuracy a little, from 99.4 to 99.5% at the cost of training complexity. It costs nearly as much as three times the time to train a double-layer SAE, since that the dimension of the training data matrix of first layer is much larger than that of the second layer. In Fig. 10, it can be concluded that the optimisation efficiency of the single-layer SAE is better than that of the secondary layer and worse than that of the first layer. Thus, considering the trade-off of training time and improved performance, single layer SAE is a better choice.

Number of hidden nodes
Number of hidden nodes affects the number of the extracted feature of SAE. More hidden nodes mean more information in the feature. It is foreseeable that more hidden nodes lead to a better classification accuracy because of sufficient information. Number of hidden nodes is set to 10 to 200 and the accuracies are list in Fig. 11.
When there are only ten hidden nodes in the SAE, accuracy of this diagnosis system drops to lower than 88%, because the  1  100  100  100  100  100  100  2  100  100  100  100  100  100  3  100  100  100  100  100  100  4 98 information extracted by only ten features is not enough for this circumstance. As the number goes up, the classifier becomes more and more accurate. There is a saturation effect of the accuracy as the increase of feature number for the reason that information is redundant. It should be noted that convolution calculation is a heavy burden for computer. The cost of time is linearly increasing with the number of features. So, it is necessary to reach a higher performance with fewer features. From Fig. 11, this number is 60 and the corresponding accuracy of validate data is 99.53%.

Pooling strategy
Max pooling and mean pooling strategies are adopted in this paper.
Considering the length of the convoluted data is 172, pooling length is set to 5, 10, 15, 20 and 25, respectively. Both training and validating accuracy are shown in Fig. 12.
When different pooling functions are used, such as max and mean function in this case, pooling size has a different impact on the accuracy. In general, classification accuracy decreases when the pooling size becomes larger, which can be attributed to that a larger pooling size will omit more information of the original data. Under this condition, mean pooling loses more information than the max pooling. However, mean pooling size has a better performance at low pooling size. The difference between training and validating accuracy is the lowest when pooling size is around 10 and 15. In this paper, mean pooling with a size of 5 is adopted because the model trained has the highest accuracy and the difference between training and validating is acceptable, which is 0.07%.
After the parameter selection in Section 4, the basic structure of the model is a 360-40 SAE with sigmoid function, the sparsity of which is 0.2. The pooling strategy is mean pooling with the size of 5.

Influence of time window
In the discussion above, time window of the data is 20 ms, which corresponds to a whole cycle of the waveform and contains all the fault waveform information. It would be better if the model can be perfectly trained with smaller amount of data, making the computation load relieved. Thus, redundancy needs to be studied by narrowing the data window. The sampling rate is still fixed to 20 kS/s. The data window ratio is set between 10 and 90% with a step of 10%. When the ratio is set to 10%, the data length is 41, with a time window of 2 ms.

Verification dataset with different frequencies
In the cases studied above, the sampling rate of data matrix is set to 20 kS/s. If this sampling rate changes, the feature trained by AE will be altered. For example, if the time interval of patches is 1.5 ms used above, the length of it is 30 at 20 kS/s. When the sampling rate decreases, that of the patches will be dropped simultaneously.
Once the feature is trained by 20 kS/s data, it is unknown that whether it is necessary to re-train the model if the input data is sampled by another frequency. In Fig. 14, two solutions are compared with each other, interpolating the original AE features or training a new AE. The original AE is trained by 20 kS/s data. When the frequency does not drop too much, such as 15 and 17.5 kS/s in this case, the classification accuracy of interpolated AE and newly trained AE is about the same. When the frequency further decreases, such as 5, 8 and 10 kS/s, the difference between the accuracies varies a lot. Fig. 14 also shows that when the frequency decreases to 8 kS/s, the overall accuracy of the newly trained data is 98.48%, which means that in the proposed fault diagnosis system, sampling frequency of the input signal must be higher than 8 kS/s.

Influence of signal channel selection
During the training and validating process above, three-phase voltage and current waveform of both sides are used. However, it is necessary to study the scenario that only part of the channels is selected. Less channel number means less usage of the storage and a faster computation speed.

Selection of types and sides:
In Section 4, after the final parameter tuning, the classification accuracy of the validating dataset is 99.5%. Compared to the result in Table 5, it can be found that removal of any channel from the original 12-channel signal will lead to a decrease of the accuracy.
More specifically, deleting primary voltages has the least influence, 99.34%. If the faulted transformer is connected to an infinite power system, the voltage barely changes when the fault occurs. Deleting the secondary voltage has the same impact as deleting the secondary current, which can be attributed to that in this case current is highly related to voltage because the secondary side of the transformer is connected to a linear load. Further removing the input signal to six channels, the fault identification accuracy is still 100%, but the non-fault data is more easily misclassified to other kinds of fault data. When only three channels of signals are selected, the accuracy is too low to be accepted.

Selection of phases:
Based on the nine-channel signal, primary current, secondary voltage and current, shown in the second column in Table 5, signal of some phases is removed in this section. The accuracies are listed in Table 6.
When two phases are selected, the overall accuracy drop from 99.5% to around 99% and performance of single phase to ground identification drops from 100 to 99%. However, when single phase signal is chosen, the results are unacceptable.
6 Experiment on a real 10 kV transformer 6

.1 Platform introduction
The experiment platform shown in Fig. 15 is a realisation of simulated system in Fig. 1, except that there is no load considered. The one on the right is a 10 kV 10 kVA oil-immersed transformer stepping voltage from 380 V to 10 kV. The other transformer on  the left is a resin-insulated one that is easy to be set to different malfunctioning conditions by shorting high-voltage winding taps. Fig. 16 demonstrates the inner structure of a high-voltage winding of single phase. From Table 7 and Fig. 16, it can be seen that taps 1-6 divide the whole winding into six parts with different number of turns. Connecting taps 1 and 2 will make use of the whole winding and the voltage of the primary side is 10.5 kV. Similarly, if taps 5 and 6 are connected, the voltage will be 9500 V. Define A1 as tap 1 of phase A for convenience.
3 Pintech HVP-15HF voltage probes are used to measure the primary voltages (channels 1-3). The ratio of the probes is 1000:1 and the maximum voltage is 20 kVpp. Waveforms are recorded by Lecroy 3054. 3 Pintech 6010A voltage probes are applied on the secondary side, the ratio of which is 100:1 and the signal is input into Lecroy 8254. Two oscilloscopes are both triggered by the same signal, voltage of phase A. Three current sensors made by Tsinghua University are chosen to monitor the current. Highest frequency of the custom-made sensor is 1 MHz which is capable of sampling the signal at the frequency of 20 kHz. The data is transmitted to a pad via Bluetooth. Time axis of the pad and the oscilloscopes are aligned together to ensure the accuracy of phase information.

Fault detection of different kinds of short-circuit cases
When normally operating, taps 1 and 2 are connected and the voltage is 10.5 kV. By manually short other taps, nine kinds of fault related to high-voltage windings can be easily deployed. For example, when setting A-A fault, taps 4 and 6 will be shorted. Turns between taps 4 and 6 are 82 and considering safety, it is the minimum turns that can be chosen. Under this circumstance, 2.4% of the inter-turn fault is deployed, noted as A4-A6 in Table 8.
When fault is deployed, there will be a serious loop current in the shorted windings that would cause a permanent damage in the winding. After several times of experiments, the time duration of short circuit must be <1 s. Randomly selected 200 0.02 s-length data records for one kind of fault, which are re-sampled as 20 kHz. In this way, the input data is transformed to the same structure as the nine-channel data introduced in Table 5, 1VI & 2V (voltages Table 6 Classification accuracies of different selections of channels (types and sides) (%) Fault type 3-C (a) 3-C (b) 3-C (c) 6-C (ab) 6-C (bc) 6-C (ac) and currents of primary side and secondary voltage). In this way, testing dataset is comprised by 2000 records, which is summarised in Table 8. It should be noticed that the training and validating data is the same as those used in the fourth column in Table 5 and these 2000 records are merely used as practical dataset to examine the robustness of the model trained by simulated dataset.
Since the confusion matrix is too big and most of the element is 0, it is simplified in the last column in Table 8 as misclassified information. For example, misclassified information of A4-A6 is 1(4), 1(8) and 3 (16), which means that 1 data record is misdiagnosed to fault 4, 1 data record is misdiagnosed to fault 8 and 3 data records are misdiagnosed to fault 16. From Table 8 it can be concluded that all accuracies of the fault testing data are higher than 96% and some of it are even close to 100%. The overall testing accuracy is 97.25%, which is 2% lower than the validating accuracy. For the reason that the signal-to-noise ratio of the testing data is about 25 dB and three-phase voltage is a little unbalanced, this 2% reduction in accuracy is acceptable.

Conclusion and future work
This paper proposed an inter-turn fault detection method for transformers based on DL method that could be used for nextgeneration smart transformer. A 12-channel signal is combined sequentially from voltage and current waveforms of the primary and secondary side. Based on the simulation method proved by previous studies, over 76,000 data records are obtained which are used as the training and validating dataset. Sixty features combined in a matrix are trained by a SAE with a sparsity of 0.2. A CSAE method is applied by convoluting the 0.2 ms data segment with the trained features. With the help of pooling and softmax layer, different kinds of faults can be classified. The final training recognition accuracy is 99.5%.
The influence of signal characteristic in this model is studied. The minimum time window of the data segment is 12 ms and the lowest frequency should be 8 kS/s. Removal of certain channels will lead to a decrease of classification accuracy. Signal made up by 9 out of 12 channels including primary current and secondary voltage and current has an accuracy of 99.34%.
Nine kinds of faults are deployed on an operating resininsulated transformer one by one and 2000 data records are collected as the testing dataset. Due to the influence of slight operating noise and unbalanced voltage, testing accuracy is 2% lower than the overall accuracy of training and testing data.
Simulated dataset and preliminary experimental data used in this paper prove the feasibility of this classification algorithm, future implementation of the method may use the practical data collected by a faulted transformer operation in the power grid, under a more complex operating situation.