Imbalanced Fault Classification of Bearing via Wasserstein Generative Adversarial Networks with Gradient Penalty

Recently, generative adversarial networks (GANs) are widely applied to increase the amounts of imbalanced input samples in fault diagnosis. However, the existing GAN-based methods have convergence difficulties and training instability, which affect the fault diagnosis efficiency. ,is paper develops a novel framework for imbalanced fault classification based on Wasserstein generative adversarial networks with gradient penalty (WGAN-GP), which interpolates randomly between the true and generated samples to ensure that the transition region between the true and false samples satisfies the Lipschitz constraint. ,e process of feature learning is visualized to show the feature extraction process of WGAN-GP. To verify the availability of the generated samples, a stacked autoencoder (SAE) is set to classify the enhanced dataset composed of the generated samples and original samples. Furthermore, the exhibition of the loss curve indicates thatWGAN-GP has better convergence and faster training speed due to the introduction of the gradient penalty. ,ree bearing datasets are employed to verify the effectiveness of the developed framework, and the results show that the proposed framework has an excellent performance in mechanical fault diagnosis under the imbalanced training dataset.


Introduction
Machine fault diagnosis plays a significant role in ensuring the normal and orderly operation of industrial production.
ere are usually two main methodologies for fault diagnosis: physic-based and data-driven-based methods. Deep learning [1][2][3], as the most popular data-driven method, has provided us with new ideas for machine fault diagnosis. Compared with traditional methods, the deep learningbased method has powerful feature learning ability which can simulate the brain learning process by constructing deep models, depicting the rich internal information of data, and finally achieving fault recognition [4][5][6][7].
In the last few years, most of the research studies on improving the accuracy of fault diagnosis through datadriven methods are carried out. Lei et al. [8] developed an unsupervised feature learning method based on sparse filtering to directly extract features from mechanical vibration signals and then classified the fault classes through softmax regression. Liu et al. [9] applied stacked autoencoders (SAEs) to diagnose the gearbox fault, and the introduction of the dropout technique and the ReLU function relieves the overfitting problem. Lei et al. [10] proposed an end-to-end long short-term memory (LSTM) model which could extract features from multivariate time series data directly. Zhang et al. [11] proposed an intercluster distance-optimized support vector machine (ICDSVM) which can not only classify different fault types but also fault severities. e above methods are proposed under the assumption of sufficient and balanced training data. But, in practice, mechanical equipment often run in the normal state for a long time so the monitor data have a limited capacity of rare fault types. And, these methods often fail to perform well in the unbalanced fault diagnosis problem and even fail to identify rare fault categories. erefore, it is necessary to develop an effective fault diagnosis method under unbalanced samples. Much of the research studies in fault diagnosis in recent years have examined how to develop diagnosis performance by improving the classification method if there are not enough samples for training. Duan et al. [12] proposed a fault diagnosis framework based on support vector data description for multiclassification. In the model, the data description is applied to the binary tree structure from the top to the bottom for classification. Zhang et al. [13] proposed a method to classify the unbalanced fault in permanent magnet synchronous motor drives, which analyzes the fault feature by the discrete wavelet transform. However, it is not enough to obtain a good diagnostic effect only through the improvement of the classification method if the imbalance degree increases. is problem cannot be settled fundamentally unless we obtain more simulation data from the raw data. In this paper, the generative adversarial network (GAN) [14] is applied to increase the amounts of imbalanced input samples. GAN has been widely applied because of its great prospect, including image recognition, language processing, and information security [15][16][17]. Until now, GAN has derived many models with different structures [18][19][20][21] and is widely used in fault diagnosis [22][23][24]. It also provided a choice for solving the problem of data imbalance in fault diagnosis by generating data for fault types with less data. Shao et al. [25] proposed a fault diagnosis framework based on the auxiliary classifier GAN (ACGAN) [26], which can generate real synthetic signals with different fault labels, and an assessment method was proposed to evaluate the quality of the generated signals. Guo et al. [27] proposed a framework based on multilabel 1D GAN (ML1D GAN). In the framework, ACGAN is applied to generate simulation damage data, and the generated and real data are both used to train the classifier. Nevertheless, the original GAN and most of its variants including ACGAN cannot accurately judge the convergence and equilibrium points of the network theoretically. e above methods are all structural improvements to GAN, but they still do not fundamentally solve the convergence difficulties and training instability problems which affect the efficiency of fault diagnosis.
Although GAN has many problems in training, the latest research studies of GAN show that it still has a broad prospect in the fault diagnosis under unbalanced samples. In order to improve the accuracy and efficiency of machine fault diagnosis, we propose a novel deep learning framework for imbalanced fault diagnosis. Firstly, the proposed method generates simulation samples through Wasserstein generative adversarial networks with gradient penalty (WGAN-GP) [28] for fault types with fewer data. It interpolates randomly between the true and generated samples to ensure that the transition region between the true and false samples satisfies the Lipschitz constraint [29]. en, the high-level features are extracted from the enhanced dataset, and different fault types are classified through an SAE model. e rest of this paper is organized as follows. e algorithms about GAN, WGAN-GP, and SAE are briefly introduced in Section 2. In Section 3, the detail of the proposed WGAN-GP-SAE framework is explained. Concrete experiment verification and the result discussion are given in Section 4. At last, some conclusions are drawn in Section 5.

Generative Adversarial Networks.
e structure of GAN consists of a generator G and a discriminator D. As is shown in Figure 1, the input of the generator is a random noise vector, and simulation signals are generated by the generator G to make the discriminator D not able to determine whether the input signal is real or generated. e output of D is a number between 0 and 1 which indicates the probability that the input is real data. Meanwhile, in general, G and D are continually optimized to improve their ability to generate and discriminate. e training objective function of GAN is shown as where x is sampled in the raw data distribution p x , z is a stochastic noise vector, G accepts z from the probability distribution p z , and p g denotes the distribution of the generated data. e discriminator D is trained to maximize the probability of identifying the source of the input data. On the other hand, the purpose of G is to make the data generated infinitely close to the real data distribution p x . A dynamic equilibrium, i.e., Nash equilibrium [30], is achieved only if p x � p g .
e Jensen-Shannon divergence is applied to measure the difference between p x and p g . e advantage of the original GAN is that it can theoretically approximate a random sample to the actual data without a hypothetical data distribution. However, the Jensen-Shannon divergence is not a reasonable cost function if the network is trained to learn the distribution supported by low-dimensional manifolds.

Wasserstein Generative Adversarial Networks with Gradient Penalty.
e gradient of the generator G would be smaller when the discriminator D is trained well enough or become larger when the effect of the discriminator D is poor. Wasserstein GAN (WGAN) uses the Wasserstein distance to evaluate the difference between the real and the generated sample distribution [20]. It has superior smoothing characteristics with respect to the Jensen-Shannon divergence. As is shown in equation (2), the Wasserstein distance is written into a solvable form by mathematical transformation: where Π(p x , p g ) is the set of all distributions c(a, b) whose marginals are p x and p g , respectively, and c(a, b) denotes how far is the distance from a to b. e Wasserstein distance can be understood as the cost of an optimal transportation plan.
When approximately an optimal discriminator is formed, the optimizer reduces the Wasserstein distance and effectively approximates the distribution of generated samples and real samples. e objective function of WGAN is illustrated as where R is the collection of 1-Lipschitz functions. e Lipschitz constraint is set to ensure that the gradient of the discriminator is not larger than a finite constant K over the entire sample space. Furthermore, the Lipschitz limit is implemented through weight clipping to make the output value given by the discriminator not to undergo a too drastic change when the input sample has slight fluctuations. WGAN checks whether the absolute values of all the parameters exceed a compact space [−c, c] after updating the parameters of the discriminator. If any, the parameters are clipped back to the threshold range. In this way, the parameters of the discriminator D are always bounded, thus ensuring that the discriminator cannot give a large difference output value to two slightly different inputs.
However, the weight clipping in WGAN leads to optimization difficulties, and the resulting critic can have a pathological value surface even when the optimization is succeeded. e parameters tend to be boundary values in this method; that is, the discriminator tends to learn a simple mapping function. e powerful fitting ability of WGAN is not exerted. Weight clipping can easily lead to gradient disappearance or gradient explosion. e gradient will become smaller when it passes to every layer of the network if the clipping threshold is set smaller and even disappear after multiple layers. On the other hand, the gradient will become larger if the clipping threshold is set a little larger, and the gradient explosion will occur. Only when the threshold setting is just right, can the generator get the appropriate backhaul gradient. But, this equilibrium area is difficult to find in practice, which makes the convergence process of WGAN slower and brings a lot of troubles to adjustment parameters.
ese problems are completely solved by Wasserstein generative adversarial networks with gradient penalty (WGAN-GP). In fact, it is not necessary to impose a Lipschitz constraint on the entire sample space. Only the generated sample set area, the real sample set area, and the transition area sandwiched between them need to be restricted. To transfer the gradient efficiently, the gradient of the discriminator is limited to around 1 directly by the gradient penalty which increases the controllability of the gradient.
e loss function of the generator remains unchanged, and the loss function of the discriminator is shown as Among them, the newly added one is the gradient penalty. x is a sample distributed on all lines between p x and p z . e introduction of the gradient penalty results in more stable gradients that neither vanish nor explode, thus more complicated networks are allowed to be trained.

Stacked Autoencoders.
Autoencoder is one of the unsupervised learning algorithms, which applies backpropagation. e hidden layer h inside the autoencoder is able to generate coding to express the input. e structure consists of two parts: encoder and decoder. e encoder is set to map the input data into hidden representation, while the decoder is referred to reconstruct the input data. e untagged input dataset x n N n�1 is given, where x n ∈ R m×1 , h n is a hidden encoder vector which is calculated from x n , and x n denotes the decoder vector of the output layer. e encoding process is indicated as where f(·) denotes the encoding function, W is the weight matrix of the encoder, and the bias vector is denoted as b 1 .
where g(·) denotes the decoding function, W T denotes the weight matrix of the decoder, and b 2 denotes the bias vector.
To minimize the error of reconstruction, the parameter of the autoencoder is optimized, and the process is shown as follows: where L denotes the loss function: L(x, x) � ‖x − x‖ 2 . e autoencoders are superimposed layer by layer to form a deep neural network; that is, each hidden layer is Shock and Vibration taken as the input of the next layer, which is progressive until the training process is completed. e introduction of batch normalization [31] has established a simple initial condition for the training of SAE, which makes the gradient update from a very shallow path, thus speeding up the training process. e labeled signals are bonded with the softmax classifier, and the BP algorithm is applied to realize the updating of network weights and fine tuning of parameters [32].

System Framework and Model Training
is part accounts for the developed WGAN-GP-SAE framework for bearing fault diagnosis. e structure of models is discussed in detail below.

System Framework Design.
e structure of the generator G contains three layers, and the number of neurons in each layer is 200, 600, and 1200, respectively. e discriminator D structure has four layers, each with 1200, 600, 200, and 1 neurons. e dimension of the random vector is 100, and ReLU is selected as the activation function. For the SAE, there are five layers in the structure where each layer is equivalent to an encoder. e extracted features of the former layer are adopted as the input for the next layer, and the process is repeated for a specified number of times until the iteration is completed. en, the BP algorithm is applied to minimize the error between learned features and labels, thus the weights of SAE are updated and the parameters are fine tuned. e system framework design is shown in Figure 2. During the data generation module, the learning rates of the generator and the discriminator are both 1E-4, the optimizer of WGAN-GP is root mean square prop (RMS), and the number of iterations is 1000. e SAE is trained to output classification results using the dataset composed of the generated sample and the original sample.

Model Training Procedure.
e program of model training can be represented in three steps.
Firstly, a set of noise vectors with a dimension of 100 are input into the generator. e generator G performs training according to the distribution of the real signal spectrum and generates a high-dimensional simulation signal.
Secondly, real data and simulation data are input into the discriminator D to output a probability value for evaluating the authenticity of the input data.
Finally, the simulation samples generated through WGAN-GP are bonded with the raw samples to enhance and balance the training dataset and then fed into the SAE model for fault type diagnosis.

Case 1: Case Western Reserve University Dataset.
is section uses the bearing fault data of Case Western Reserve University (CWRU) to validate the proposed fault diagnosis framework.
e frequency of sampling is 48 kHz. e bearing is damaged by the electrical discharge machining (EDM) single point and divided into four health conditions: normal, inner ring fault, outer ring fault, and rolling body fault. ere are three types of damage dimensions for each type of fault: 0.18 mm, 0.36 mm, and 0.54 mm, for a total of 10 healthy types of bearing datasets. ey are named as NC, IF1, IF2, IF3, OF1, OF2, OF3, RF1, RF2, and RF3. Each health condition consists of 500 samples, every sample includes 2400 data points, and we perform fast Fourier transformation (FFT) operation on each sample data to obtain 1200 Fourier coefficients.
Two datasets with different unbalance degrees are collected to verify the proposed method, as in Table 1. In Dataset A, the percent of NC samples is 50%, the percent of fault type samples with damage sizes of 0.18 mm and 0.36 mm is 40% and 30%, and the percent of fault type samples with damage size of 0.54 mm is 20%. To facilitate comparative studies, the percentages of test samples are still 50%. Considering the aggravation of the imbalance, more training samples are reduced in Dataset B. In order to allow meaningful comparisons, the GAN with SAE (GAN-SAE) and the WGAN with SAE (WGAN-SAE) are employed for diagnosis, and the composition of the two comparative frameworks is the same as the proposed method.
WGAN-GP is applied to simulate the bearing spectrum signals of ten healthy states in turn. We combined the generated signal with the original dataset to enhance the percent of all types of samples up to 50%. e comparison between the real signal and the generated signal is depicted in Figure 3. It can be seen that the simulation signals have almost completely unified spectral characteristics with the real signal, which indicates that WGAN-GP has powerful feature extraction ability. To further illustrate the feature extraction ability of WGAN-GP, we study the process of feature learning of the network. e unit numbers of each layer are 200, 600, and 1200. In order to better illustrate the feature extraction of each hidden layer, we selected the extracting process of OF3 fault samples and drew all the extracted eigenvectors of each layer together to reveal them in three-dimensional space. e result is shown in Figure 4. It is clear that the amplitude of the eigenvector is increasing with the depth of the layer, and the main characteristics are also more distinct with the increase in the number of layers. As is shown in Figure 4(d), the frequency spectra of generated samples have identical trend with the raw. Furthermore, it is worth noting that the generated signal retains the characteristics of the original signal and is much sparser, which may be beneficial to further fault classification. In other words, all the major characteristics have been extracted from the raw signals, and some noise elements that are not conducive to mode recognition are removed by WGAN-GP.
In order to eliminate the influence of accidental factors, we conducted each experiment 20 times. e test accuracy of the three methods for Datasets A and B is shown in Figure 5. For Dataset A, the average accuracy of GAN-SAE, WGAN-SAE, and WGAN-GP-SAE is 85.05%, 96.75%, and 99.90%, respectively. e standard deviation of the three methods is 0.82%, 0.48%, and 0.10%, respectively. For Dataset B with a higher degree of imbalance, the average accuracy of GAN-SAE decreases to 76.31% with a standard deviation of 1.51%, 4 Shock and Vibration and the average accuracy of WGAN-SAE decreases to 91.56% with a standard deviation of 0.63%. By contrast, WGAN-GP-SAE achieves an average accuracy rate of 96.97% and a standard deviation of 0.24%. erefore, the performance of the proposed method in unbalanced fault classification is far better than that of the other methods. When the unbalanced degree of training data is not serious, the method proposed in this paper can improve the accuracy of fault diagnosis to a certain extent. When the degree of imbalance is aggravated, there is less information available for signal types with less data in the original signal, and the quality of the generated sample will be affected. In this case, the accuracy can be greatly improved through our method. For the visualization sake of the three methods in fault classification, the t-distributed stochastic neighbor embedding (t-SNE) [33] technique is applied to map highdimensional features into three-dimensional (3D) space. e dimensionality reduction results of Dataset A are shown in Figure 6. It can be seen that the GAN-SAE frame can only accurately identify the NC, IF1, OF1, and RF3 samples, but there are different confusion levels between other fault samples; as is shown in Figure 6(b), WGAN-SAE has a slight degree of confusion on classification. In Figure 6(c), WGAN-GP-SAE perfectly separates all bearing health type samples, and the same type of samples are highly polymerized, which proves the superiority of the proposed method. But, when the imbalance degree increases, as shown in Figure 7, GAN-SAE has a poor classification of most fault types, and the classification effect of WGAN-SAE also becomes worse. In contrast, for WGAN-GP-SAE, there are only several samples being misclassified in Figure 7(c), which is much better than the other two methods. erefore, we can conclude that the developed framework has the strongest feature extraction and feature classification capabilities than the other two methods.

Case 2: Shandong University of Science and Technology
Dataset.
e experimental equipment of rotating machinery in the Shandong University of Science and Technology (SDUST) is shown in Figure 8. Vibration acceleration signals come from a specially designed bench which consists of a gearbox, a motor, three shaft couplings, two rotors, two bearing seats, and a brake. All types of bearing failures measured by our laboratory are the same as that of the CWRU. But, our sampling frequency is 12.8 kHz, and the bearing failure damage dimensions are 0.2 mm, 0.4 mm, and   250  250  250  IF1  200  100  250  IF2  150  50  250  IF3  100  25  250  OF1  200  100  250  OF2  150  50  250  OF3  100  25  250  RF1  200  100  250  RF2  150  50  250  RF3  100  25  250 Shock and Vibration 5   Shock and Vibration 0.6 mm. Considering the rotation frequency of the shaft as 1500 rpm, each period of rotation contains 480 data points. For avoiding the influence of speed fluctuation, each sample collects five periods of rotation data (2400 data points). And, each condition includes 500 samples, and there are 4000 samples in the dataset of our laboratory. e percent of samples with the same damage size is equal in Datasets A and B. To explore the diagnostic capabilities of our framework when the factor leads to imbalance changes, we established another unbalanced sample dataset (Dataset C) based on the data of SDUST. As shown in Table 2, the percent of IF, OF, and RF is 30%, 20%, and 10% in the training dataset, respectively. e frequency spectra of raw samples and simulation samples are shown in Figure 9, and it is clear that the trend of the simulation signals is highly similar to that of the raw signals.
To explore the feature extraction ability of WGAN-GP, the OF3 samples are selected to compare the generated sample with the original sample for the 3D diagram. As is shown in Figure 10, the simulation signals generated by WGAN-GP have almost the same spectral characteristics as the original samples. We tested each method 20 times,   Figure 11. e average accuracy of GAN-SAE, WGAN-SAE, and WGAN-GP-SAE is 79.44%, 92.12%, and 98.15%, and the standard deviation of the three methods is 1.05%, 0.41%, and 0.24%, respectively. Furthermore, we applied t-SNE to cluster the high-dimensional features learned by the three frameworks. e visualization of dimensionality reduction is reflected in Figure 12. In Figure 12(a), the samples of IF3 and OF3 are overlapped seriously, and other types of samples are mixed up to varying degrees. In Figure 12(b), there are still some samples not clustering well. For the proposed framework, as is shown in Figure 12(c), almost all the samples with the same health status are gathered closely, and different types of samples are separated completely. erefore, we can conclude that the developed model can be used for fault diagnosis efficiently despite the change of unbalanced factors.
To evaluate the availability of generated samples and explore the influence that the quantity of simulation samples has on diagnosis accuracy, we set up several scenarios with different numbers of generated samples to test their performance. It is important to note that the quantity of each type in the scenarios is the same in the training process, and the test samples were still 250 of each type. All the scenarios are still tested by SAE. e detailed information about training data settings of each scenario and the classification accuracy of these scenarios is listed in Table 3. From scenarios A and C, it can be seen that when only the generated samples are used for classification, the accuracy is not far different from that when only the same number of original samples is used. From scenarios B, C, D, E, and F, it can be seen that if there are enough generated samples in the training dataset, we can obtain high accuracy just by using the generated samples. From scenarios G, H, and I, we can see that when there are a certain number of real samples in the training data, the supplement of high-quality simulation samples can make up for the impact caused by the absence of real samples.
Due to the introduction of the gradient penalty, WGAN-GP theoretically has a faster convergence speed than WGAN. To verify the convergence property of WGAN-GP, we visualized the loss value of the discriminator D and the generator G in the feature learning process. e loss curve of the discriminator and the generator in WGAN and WGAN-GP is plotted in Figure 13. As is shown in Figure 13(a), the loss value of the discriminator in WGAN converges to 0 at 500 iterations, and the generator loss converges at about 650 iterations, but the two losses do not converge at the same time. In Figure 13

Discussion.
e WGAN-GP-SAE framework is proposed for the classification of mechanical unbalanced faults in this paper. WGAN-GP is employed to generate simulation samples to enhance the raw samples, and an SAE model is set for fault classification. e proposed method was tested on the rolling bearing data of CWRU and further demonstrated on the data measured in our laboratory. e results indicate that the proposed framework can effectively solve the problem of fault diagnosis under sample imbalance.
Original GAN and many of its derivative models have difficulty on convergence. We studied the loss value curve in the process of data generation. It can be seen that WGAN-GP has achieved the simultaneous convergence of the generator and the discriminator in 400 iterations, which is much stronger than the convergence performance of WGAN. erefore, the introduction of the gradient penalty improves the performance of the proposed model.
Despite the good performance in fault diagnosis, the original parameters of WGAN-GP-SAE settings will no longer apply if the classification task changes. Furthermore, it is difficult to explain how the weights of each layer are affected when the parameters change during the training process. Maybe we can set some initial conditions to the whole network, thus the parameters are able to be adaptively selected, or introduce the weight agnostic neural network (WANN) to build a shared weight for the network that can update automatically [34].

Conclusion
In this paper, an imbalanced fault classification model WGAN-GP-SAE is proposed. In this framework, WGAN-GP is applied to augment fault type signals with less samples so that all fault type samples reach data balance. Fault samples consisting of original samples and generated samples are classified by SAE. e effectiveness of the proposed method is verified by three datasets with different degrees of unbalance. We visualized both the features learned from each hidden layers of WGAN-GP and the highdimensional features extracted by SAE to show the powerful feature extraction and classification capabilities of the proposed framework more intuitively.
rough supplementary experiments, it is proved that good accuracy can be obtained by only using the generated samples, which indicates that the simulation samples generated by WGAN-GP are effective and reliable. Moreover, we explore the loss value of WGAN-GP in the process of data generation, which proves that the introduction of the gradient penalty can   make the whole network own a superior convergence performance. erefore, WGAN-GP-SAE has an excellent effect on the processing of sample unbalance in mechanical fault diagnosis.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.