Ensemble deep learning for automated classification of power quality disturbances signals

The automatic classification of power quality disturbances (PQD) is of great significance for solving power quality problems. In this study, we propose an ensemble deep learning framework to realize intelligent classification of PQ disturbances. Specifically, based on the characteristics of the sequence of disturbance signals, the Long Short Term Memory (LSTM) network is used to classify the signals. In addition, the Bagging theory is used to integrate the training results of multiple LSTM networks to improve the generalization of the network. Our contribution lies in the combination of deep learning and ensemble learning to extract the classification representation of PQD signals. In view of the large number of unlabeled power quality disturbance samples in the power grid, the active learning strategy is adopted to select the most representative samples from the data set, which can enhance the model performance with less labeled data. Finally, experiments were conducted in different noise environments. Compared with the existing multi-label learning models, this method achieves better classification performance with good calculation speed. Furthermore, the proposed active learning strategy is able to train the classification model with fewer labeled samples, reducing the manual labeling costs.


Introduction
With the increasing application of solid-state switches, non-linear devices and various renewable power generation connected to the smart grid, there are increasing fluctuations in voltage, current and frequency in power systems. These interferences bring great challenges for the safe and economic operation of power systems [1]. On the other hand, with the increasing power consumption due to economic development and electrification of heat and transport, electricity consumers and electrical equipment also put forward higher requirements on the quality of power supply. Therefore, accurate and effective classification of power quality disturbance (PQD) signals is of increasing significance for power quality (PQ) evaluation and management [2].
In the past few decades, researchers have conducted detailed explorations on the detection and classification of PQD signals. Conventional disturbance classification methods consist of three steps, i.e., signal processing, feature selection and classifier design [3]. For signal analysis, many methods and improved algorithms have been proposed based on signal processing technology, including short-time Fourier transform, (STFT), S transform (ST), Wavelet transform (WT), Variational Mode Decomposition (VMD), Empirical Mode Decomposition (EMD), and Hilbert Huang Transform (HHT). The signal processing methods are dedicated to improving the time-frequency resolution of the signals, but the above methods have different degrees of limitations. STFT is difficult to weigh the time-frequency resolution, and the effect of capturing signal transient characteristics is not good [4]. The series of WT methods are more suitable for the signal processing of PQDs, but its classification performance depends heavily on feature selection and classifier design [5]. EMD lacks the support of mathematical theory and is prone to false components [6]. Besides, all these methods are prone to be affected by the noise and the computation burden is heavy. Therefore, a signal processing algorithm with wider applicability and better anti-noise ability should be adopted.
Feature selection is a critical step to find representative features from PQD signals. The extraction of key features can significantly increase the classification speed of the classifier while ensuring accuracy. Currently, various manual features have been designed to represent PQD signals. The design of the features lacks reference, it is still not very clear why these features are chosen, what the correlation between the features is, and whether these features help distinguish different types of PQD signals. To solve the above problems, some feature optimization algorithms have been proposed. The particle swarm optimization (PSO) algorithm is used to optimize the parameters to filter out the optimal feature subset in [7]. In [8], a Probabilistic Neural Network based Artificial Bee Colony (PNN-ABC) is proposed to optimize the feature selection. The non-optimality of manual feature selection and the time-consuming feature of the optimization process highlight the need of a more effective method to select meaningful and distinct morphological features and realize automatic construction and optimization of the feature set.
Besides the limitations in signal processing and feature selection as described above, the three steps of traditional PQD classification algorithms are carried out independently with a limited combination of signal processing methods and classifiers, limiting the classification accuracy. Therefore, the disturbance identification process should be optimized as a whole to simplify the analysis and improve the calculation efficiency while ensuring the accuracy.
Deep learning is an emerging and promising classification framework, which can learn features and capture patterns from multi-level abstract signals (such as PQD signals) [9]. The Long Short Term Memory (LSTM) network is a special recurrent neural network (RNN) in the field of deep learning. It has achieved great performance in dealing with text classification, sentiment analysis, and time series prediction. Considering PQD signals are in nature time series data, the LSTM network is proposed to be used for PQD classification in this paper. The main work and contributions of this paper are as follows: (1) Based on the characteristics of PQD signals, a new method for PQD classification based on ensemble learning and deep learning is proposed, which combine the three steps of traditional methods without using additional feature selection module, thus simplifying the classification steps. (2) Based on the Bagging algorithm, we trained multiple LSTM base classifiers and improved the traditional majority voting strategy. The best-worst weighted voting strategy is proposed to better improve the network generalization performance. Performance of the proposed method is verified in comparative experiments. (3) To solve the problem of less measured data on labels, the active learning algorithm is adopted in sample selection to enhance the utilization efficiency of unlabeled samples. By actively selecting the samples with the most abundant information, the need of annotation for deep neural network can be reduced.
The rest of this paper is organized as follows. The proposed Bagging-LSTM is presented in Section 2, in which the weighted voting strategy is embedded in the Bagging-LSTM. We then introduce the active learning strategy in Section 3. Comparative experiments are presented in Section 4. Finally, the conclusion of this paper is given in Section 5.

Regular LSTM
LSTM is a special recurrent neural network applied in the field of deep learning, which is an improvement of the RNN model. The key of the LSTM model is to introduce a memory unit for cyclic information transmission, recording all historical information up to the current moment. Therefore, compared with the short-term memory of traditional RNN, LSTM has long-term memory capabilities: a gating mechanism (involving forgetting gate, input gate and output gate) with a value between (0,1) is used to control the transmission path of the internal information in the model [10].
The structure of the cyclic unit is shown in Fig. 1, where x t , f t , i t, o t, c t , a t and h t represent the input, forget gate, input gate, output gate, memory storage unit, candidate state, and output of the hidden layer at time t, respectively. c t-1 and h t-1 are the output of memory storage unit and hidden layer at time t-1. δ is the logistic sigmoid function, and tanh is the activation function.
We denote the connection weight of the forgetting gate, input gate, output gate, and memory unit as W xf , W xi , W xo , and W xc respectively. We also use W hf , W hi , W ho and W hc to represent the weights between the hidden layer and the forget gate, input gate, output gate, and memory unit. b f , b i and b o represent the bias. ⊙ represents the product of vector elements. The LSTM loop structure unit controls the flow of information by controlling the degree of opening and closing of the forgetting gate, input gate and output gate. The specific process is shown in the following Step 1 to Step 6.
Step 1: The forgetting gate f t takes the input x t of the current layer and the output h t-1 of the hidden layer at the previous moment as input, and the output result of the forgetting gate is multiplied by c t-1 to control how much information needs to be forgotten in the internal state c t-1 at the previous moment. The forget gate is expressed as Step 2: The input gate selectively saves the current input information, and the output result i t is used as the information to be updated. The input gate is obtained as Step 3: The output gate o t controls the internal state c t at the current moment, to control how much information needs to be output to the external state h t . The output gate is expressed as Step 4: The output gate o t is multiplied by the state of the memory unit processed by the tanh layer. The output h t of the hidden layer is calculated as Step 5: The memory unit c t records the historical information up to the current moment, which can be calculated by Step 6: The candidate state is calculated by

Ensemble learning
L. K. Hansen first proposed the concept of ensemble learning, which improved the generalization performance of learners by constructing multiple neural networks [11]. The main learning modes of the ensemble include Bagging, Boosting and Stacking. Bagging focuses on reducing the overall variance, while Boosting and Stacking focus on reducing model bias. To use an artificial neural network or a deep learning method in the classification of PQD signals, it is usually more likely to obtain an overfitting model rather than an underfitting model. Therefore, reducing the variance is the main consideration to improve the performance of the artificial neural networks model, and thus a Bagging-like ensemble model is used in this study [12].
The Bagging algorithm is shown in Fig. 2. At first, T sample sets are obtained by random sampling T times, and T independent weak learners (also known as base classifiers) are developed respectively. Then, through the voting strategy of T weak learners, the final strong learners can be obtained. To ensure an ensemble model makes sense, the diversity of the base classifiers is a key factor. Instead of structural variations, different training datasets are fed into LSTM to meet the diversity requirement. Bootstrapping method is commonly used in the random sampling algorithm. For the original training set of M samples, by random sampling of M times with replacement method, a sample set containing M samples is obtained. In this way, the training datasets of each base classifier is different from the original training set and other sample sets, which guarantees the diversity of each base classifier in ensemble learning.

An improved bagging algorithm
In traditional Bagging algorithms, the majority voting method is usually used to make decisions, which makes all the base classifier have the same decision-making power and ignores the performance differences of the base classifiers [13]. This is the main limitation of the majority voting methods. Generally, the selected classifiers have different competence. Therefore, the weighted voting method is used to aggregate the selected classifiers decision. In this method, the output of each classifier is weighted by coefficients that affect the combination process. Assuming w i is the weight of the i th classifier, the weighted majority voting is represented as follows: where C=[C 1 ,C 2 ,…, C L ] is a collection of L classifiers; x is the input; C ij is the output of the j-th category of the i th classifier; w i is the weight of the i th classifier. In this study, the best-worst weighted voting method is adopted as a measure to quantify the weights [14]. The basic idea of this method is to identify the worst and the best members on the ensemble using their estimated errors on the validation set. The relative accuracy of the i th individual learner, a i , is expressed as where e w and e b are the maximum and minimum error rates of all individual learners (error rate = 1 -accuracy rate); e i is the error rate of the i th individual learner. Then the weighted of every classifier is expressed as

Bagging-LSTM framework for classification of PQD signals
The network structure of the LSTM and the improved Bagging algorithm are introduced in the previous sections, based on which we propose a PQD signal classification framework. The framework of the proposed Bagging-LSTM is presented in Fig. 3.
The algorithm uses the LSTM network as the base classifier, and integrates the results of multiple LSTM networks with differences through the Bagging algorithm to form a strong classifier. To a certain extent, the 1) According to the 'ten-fold cross validation principle', 10% of the training data of the PQ disturbance signal is split as a verification set. 2) Bootstrapping: T training samples are obtained by random sampling with replacement from the training set. 3) For the t-th training sample, the LSTM model is used to train the t-th classifier, which is the t-th base classifier. 4) Train the base classifier and collect the accuracy data from using the validation set corresponding to the T-th training sample, so as to calculate the weight of the T-th weak classifier. 5) Link all base classifiers with the corresponding weight distributions, and use the best-worst weighted voting strategy to predict the category of the test set and calculate the accuracy.

Active learning
Most of the sample data sets used to train the classifier are generated by MATLAB software in the existing PQD classification research, and a number of training samples are randomly generated, so that we can easily label the sampled signals obtained. However, the detectors deployed in the power grid in practice are collecting signals on the transmission lines all the time, and labeling the collected samples of power quality events requires manual operation by experts. This work is costly and time-consuming, and it is easy to misjudge under the condition of visual fatigue. Therefore, how to train the classification model with as few labeled samples as possible, so that the training time and size of the model can be reduced, is a problem that we need to further explore.
The active learning technology can interact with experts to select the most valuable uncalibrated data for experts to label, so that the model can obtain better performance with less labeled data. The active learning algorithm is a cyclic iterative process, and its model is mainly composed of four parts, as shown in Fig. 4, including target classification model, labeled sample set, unlabeled sample pool, and human experts [15] We use the uncertainty sampling strategy to select the unlabeled sample, which is widely used in active learning. In this strategy, the classifier gives each unlabeled sample a confidence score as the evaluation score to determine its uncertainty. The confidence score is expressed as: where ŷ is the category with the highest predicted probability of the model for sample x i . Then sort the confidence scores of the unlabeled samples, and select the sample with the lowest trust level, which is most likely to be misclassified by the model for expert annotation. Then add these labeled samples to the labeled sample set and use the updated labeled sample set to train the model again. In this way, the labeled sample set is updated, and the model is continuously trained until the model performance reaches a certain standard, and then the iteration is stopped. The active deep learning model is fine-tuned incrementally to speed up the convergence of the model.
The sampling frequency is set to 6.4 kHz, and the data length of a single sample is 1280 points. The label data is represented by a one-hot encoding, such as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0] (indicating that the sample belongs to the first type). 12,000 samples have been generated. However, there is always noise in the data collected in the power grid. In order to enhance the anti-noise performance of the proposed method, Gaussian white noise is randomly added to the synthetic PQDs data of different levels. The signal-to-noise ratio (SNR) varies from 20 dB to 50 dB Then perform Ten-fold cross-validation, and split 10% of the training

Optimizing the bagging-LSTM neural network
We build the Bagging-LSTM neural network and set the model parameters as follows: the number of base classifiers is 3, the epoch is 100, the batch-size is 64, and the number of LSTM hidden neurons is 64. The cross-entropy loss and Adam optimizer are used to update the model parameters. The learning-rate is 0.001. The length of the PQD input is determined by the timestep. The network is trained by using the data without noise interference. The consuming time represents the time it takes for the network to update the parameter weight once. Table 1 shows the performance comparison of the network under different timesteps.
To explore the impact of the timesteps, the accuracy, loss and training time with different timesteps are verified. We can see that, the training time increases as the timestep value increases. However, the training accuracy and loss first increase and then decrease, which indicates that choosing appropriate timestep is very important to the performance of the network. A too short timestep may not contain enough disturbance information, resulting in low accuracy, while a too long timestep can cover complete information but with low efficiency. Balancing the accuracy of network classification and training time, we choose timesteps as 10.
In order to verify the superiority of the proposed improved voting strategy, we compared the best-worst weighted voting strategy used in our study (Vote 1) and the traditional majority voting strategy (Vote 2). The base classifiers trained by different samples are named as LSTM 1, 2, 3 respectively. The comparison results of the two voting strategies are shown in Fig 5. The results are the average of 10 training sessions. It can be seen from Fig. 5 that the accuracy of using the voting strategy proposed in this paper is higher than that of the traditional majority voting strategy. In other words, the Bagging algorithm participates in decision-making through multiple classifiers, and obtains better generalization performance than a single classifier.

Comparing the bagging-LSTM with existing methods
The proposed Bagging-LSTM method is compared with four other deep learning methods, which are briefly introduced as follows: (1) Deep Convolutional Neural Network (DCNN): The deep convolutional neural network with six standard convolutional layers proposed in [5] is used as a reference method. The convolution kernel is set to 3, the stride is 1, and the number of convolutional layer filters is set to 32, 32, 64, 64, 128 and 128. so that the DCNN has a huge number of parameters. (2) Convolutional Auto-encoder (CAE) proposed in [9] is used as another reference method. The down-sampling factor is set to 2, 8 and 2, and the parameter setting of the decoder is symmetrical to the encoder. For fair comparison, the neural networks in the reference methods are trained and compared on the same computer and data set. The neural networks are trained through 100 iterations. As is clear from Table 2, DCNN can achieve higher classification accuracy, but the training time is longer. Compared with the Bagging-LSTM network, the GRU network has a slightly shorter training time, because the separate storage unit is eliminated, and the accuracy rate is slightly lower than that of the Bagging-LSTM network. Table 2 demonstrates that Bagging-LSTM proposed in this paper is an optimal choice for PQDs classification, which has higher classification precision and less training time cost..
In order to further compare the Bagging-LSTM network with other traditional methods, Table 3 lists the performance of these methods for complex PQDs. The traditional methods require design handcrafted features before classification. The number of features in these methods is relatively random. For example, [17] selected 19 features while [1] selected only 9 features. Compared with these traditional methods, the Bagging-LSTM can automatically extract valid features, which simplifies the classification process. This makes the PQD classification more integrated and automated.

Active learning verification
In order to verify the effectiveness of the active learning strategy, we compare the uncertainty-based active learning sampling method (US) with the random selection (RS) method. Both of the two sampling methods evaluate the performance indicators through ten-fold cross validation, and the classifier is Bagging-LSTM network. The algorithm queries 0.5k samples every time, and each curve in the Fig. 6 shows the average after ten random training. The vertical axis is the performance evaluation index commonly used in active learning: F1-score and loss. Fig. 6 shows that in the early stage of training, the performance of active learning sampling is close to random sampling. As the number of active queries increases, the training result of the active learning becomes better, showing its advantages.   The F1-score after several queries is presented in Table 4. It is seen that the performance of the active learning algorithm based on uncertainty sampling becomes better than that of random sampling. When more than 3000 labeled samples are added, the F1-score of active learning reaches 0.9840, while random sampling needs to add nearly 5000 labeled samples to achieve similar performance. In this sense, the active learning algorithm can save about 40% of the manual labeling workload.

Testing using practical data
A set of practical signals are used to test the performance of the Bagging-LSTM network. The measurement data set is taken from the IEEE PES database for PQD classification [19,20]. The sampling rate of the signals is 256 points per cycle. The length of signals is 1536. Since the neural network model requires a large amount of data for training and optimization, the quality of data directly affects the performance of the model. The current data volume is not enough to train a good network, so we use data augmentation methods to increase the amount of training data. Data augmentation is an effective way to expand the size of data samples by flipping, random cropping, adding Gaussian noise, etc. Fig. 7 shows the result of data enhancement of the transient oscillation signal.
With the enhanced data set to train the proposed Bagging-LSTM model, the Table 5 shows the classification result comparison. The average accuracy of Bagging-LSTM is 95.83%, which is lower than the simulation results presented in the previous sub-section. The main reason is that the data is still not enough, in spite of the data enhancement used. The interference of real data signals is more complicated, and the distribution between categories is uneven, further decreasing the accuracy. Although the classification result is not ideal, it can be seen that the Bagging-LSTM performs better than the other methods.

Conclusion
In this paper, a Bagging-LSTM network is proposed to automatically classify complex PQD signals by combining deep learning and ensemble learning. An improved Bagging strategy is used to integrate the training results of multiple LSTM networks, so as to improve the classification accuracy and generalization. Furthermore, an active learning algorithm is used for sample selection. By actively selecting the samples with the most abundant information, the requirement for annotation and the corresponding cost can be reduced. In addition, the data augmentation technology is further adopted to alleviate the problem of insufficient labeling by creating a modified copy of the original sample. Compared with other deep learning methods, the proposed method has the advantage of less training time and higher precision. Future work will focus on developing smarter query strategies and effective model update strategies to achieve a better trade-off between classification accuracy and labeling cost.