Wasserstein Generative Adversarial Network and Convolutional Neural Network (WG-CNN) for Bearing Fault Diagnosis

In recent years, intelligent fault diagnosis technology with deep learning algorithms has been widely used in industry, and they have achieved gratifying results. Most of these methods require large amount of training data. However, in actual industrial systems, it is difficult to obtain enough and balanced sample data, which pose challenges in fault identification and classification. In order to solve the problems, this paper proposes a data generation strategy based onWasserstein generative adversarial network and convolutional neural network (WG-CNN), which uses generator and discriminator to conduct confrontation training, expands a small sample set into a high-quality dataset, and uses one-dimensional convolutional neural network (1D-CNN) to learn sample characteristics and classify different fault types. Experimental results over the standard Case Western Reserve University (CWRU) bearing fault diagnosis benchmark dataset showed that the proposed method has obvious and satisfactory fault diagnosis effect with 100% classification accuracy for few-shot learning. In different noise environments, this method also has excellent performance.


Introduction
In recent years, with the rapid development of high-performance computing, big data, and deep learning technologies, intelligent fault diagnosis and pattern recognition technology based on deep learning has attracted wide attention from scholars in the field because it does not rely on subjective human analysis and reasoning. ere are many research studies on fault diagnosis based on open source data, which can better reflect the comparative and dynamic nature of the research. ere are basically four major open source bearing fault datasets in the world, Case Western Reserve University (CWRU) datasets, Paderborn University bearing datasets, PRONOSTIA bearing dataset, and Intelligent Maintenance Systems (IMS) datasets. e application of the earliest artificial neural network (ANN) algorithms to motor faults can be traced back to [1,2], and ANN was used to make a comprehensive  [3][4][5][6] all used a certain degree of human experience knowledge to make fault feature selection to train ANN more effectively. In [7], it was one of the earliest literatures on bearing fault diagnosis using PCA. In addition, classic papers based on PCA [8,9] made use of their data mining capabilities to promote "manual" feature selection and extract more representative fault features. In the problem of bearing fault classification, the results obtained by SVM in [10] are the best in all cases, which comprehensively improved the performance of ANN. Other papers [11][12][13] used KNN to conduct distance analysis on the new data sample and determine whether it belonged to a specific fault category. In addition to the aforementioned common ML methods, many other classification algorithms have also been applied to the identification of bearing faults, Bayesian networks [14], ELM [15], transfer learning [16], random forest [17], independent component analysis [18], manifold learning [19], typical variable analysis [20], expectation maximization [21], set learning [22], empirical model decomposition [23], and dictionary learning [24].
However, the industrial environment is more complex; some bearing fault characteristics are difficult to be artificially extracted or explained because of high dimensional characteristics.
ese "weak" classical machine learning methods based on manually selected features sometimes give inaccurate classification results. erefore, many deep learning algorithms with automatic feature extraction capability and better classification performance have been successfully applied to bearing fault diagnosis and also achieved gratifying results. e first paper using CNN to identify bearing faults [25] was published in 2016. In the following three years, papers using the same technology [26][27][28][29][30] promoted the development of various bearing fault detection. Additionally, other deep neural networks have been successfully applied in this field, such as deep belief network (DBN) [31], recurrent neural network (RNN) [32], autoencoders [33], generative adversarial network (GAN) [34], and other related technologies. Among them, GAN was proposed in 2014 and quickly became one of the most exciting breakthroughs in the field of deep learning. In order to achieve a better balance between training speed and accuracy, adaptive CNN (ADCNN) was applied to the CWRU dataset to change the learning rate dynamically in [26]. Many variants of CNN have also been used to solve bearing fault diagnosis [27][28][29][30]. Semisupervised generative adversarial network (SSGAN) was used on the CWRU dataset to achieve a gratifying result [35].
Although the above studies have achieved encouraging results, there are still some problems such as difficulty in data collection and a large amount of noise in the data in bearing fault detection field. When using ML and DL algorithms, sufficient and effective training data samples cannot be obtained, so it is difficult to achieve extremely high classification accuracy. is paper proposes a deep neural network based on the combination of GAN and CNN to solve the problem of bearing fault diagnosis with limited data. is model greatly improves the accuracy of classification and the robustness of the model in the case of reducing the dependence on the original dataset.
Our contributions mainly include the following: (1) is paper proposes a pure data-driven method based on WGAN to artificially synthesize new annotated fault type samples. More specifically, we use WGAN to estimate the distribution of observed fault samples and generate new samples that can be used for training deep networks. With this strategy, our training dataset is expanded and enhanced. Onedimensional convolutional neural network (1D-CNN) is used to extract classification features efficiently and to train the model from original and generated samples. (2) We analyzed the difference between the original data and the generated data from qualitative and quantitative perspectives.
(3) In order to effectively improve the classification effect of the CNN model, we optimized the relevant quantities (convolution kernel settings, activation function, batch size, and learning rate). (4) Compared with other comparative experiments, the proposed model was verified on the CWRU bearing fault diagnosis benchmark dataset with an accuracy of 100%.
e other parts of the paper are organized as follows. Section 2 introduces the basic model of CNN and WGAN and the proposed model based on WG-CNN. Section 3 presents the experimental process and a certain result analysis. Section 4 summarizes this paper.

GAN and WGAN Models.
e generative adversarial network (GAN) consists mainly of two submodules: the generator model is defined as G and the discriminator model is defined as D. GAN is based on the idea of competition. e purpose of G is to confuse D, and the purpose of D is to distinguish between the generated data from G and the data from the original dataset.
In more detail, G: Z ⟶ X, where Z is the noise space which has any dimensions; it is corresponding to the hyperparameter space. X is the data space; the purpose is to get the data distribution. e generator generates new data by fitting data features in the data space and randomly adding noise. e value function of G and D is However, the above methods have a series of problems such as unstable training and unconvergent generator loss function. WGAN [36] solved the problem of GAN training instability. It uses the Wasserstein distance as shown in the following equation: where P r and P θ are the distribution of the original data and generated data and (P r , P θ ) represents the joint distribution.
For the aforementioned equation, the loss function of WGAN is Compared with GAN, WGAN almost solves the problem of unstable training, thus ensuring the diversity of generated samples. In this paper, we use the WGAN proposed in [36] as the data generation model and adopt the gradient penalty strategy of [37]. e overall structure of the generator and discriminator is shown in Figure 1. And the network structure of generator and discriminator is the same, as shown in Figure 2.

CNN Model.
e convolutional neural network is a typical deep feed-forward neural network with a structure similar to that of a multilayer perceptron. Unlike multilayer perceptrons, CNN contains a convolution kernel for extracting features. In a convolutional layer, there are usually several feature planes; each of them consists of a number of rectangular arranged neurons, and one neuron is only connected to a part of the adjacent layer neurons. Its advantage is reducing network complexity, improving computing efficiency, and increasing the network's ability to fit. A typical CNN architecture consists of the following basic elements.

Convolutional Layer.
e convolutional layer consists of several convolutional neurons. e weight and parameters of each neuron can be derived by a backpropagation algorithm. A convolution filter within the convolutional layer provides a compressed representation of the input data. Convolution filters can extract features from the input data. Each filter consists of weights that are adjusted during the training phase of the network. rough the convolutional layer, the network can extract low-level edge features of the input data vector. e calculation formula of the convolutional layer is as follows: where x k m and x k−1 m , respectively, represent the outputs of the m th node in the k th layer and the k−1 th layer and w k j and b k j , respectively, represent the weights and thresholds corresponding to the m th node in the k th layer.

Pooling Layer.
e pooling layer neurons perform pooling operations on the features obtained by the convolutional layer, thereby segmenting the feature regions to obtain features with smaller values such as the highest value and the mean value of the feature regions. e transformed features are subsampled by a particular factor in the subsampling layer. In order to calculate the values of a particular feature in an area of the input layer and merge them together, the role of the subsampling layer is to reduce the variance of the transformed data. e expression of the pooling layer is as follows: where h(x) represents the maxpooling function.

Fully Connected
Layer. e fully connected layer converts all local features into global features, thereby obtaining the output of the network. e neuron can be expressed as follows: where y represents the output of the neuron and w k− 1 and x k− 1 represent the connection weights and thresholds of the k th layer to the k−1 th layer, respectively.

Activation Function.
After several convolutional and pooling layers, the classification function in the neural network is performed through fully connected layers. e neurons of the fully connected layer are all connected to all activation functions in the previous layer. e fully connected layer ultimately transforms the two-dimensional feature map to a one-dimensional feature vector. e derived vector can be used for two or more classifications and can be used for further processing. e structure of CNN is shown in Figure 3.

e Proposed Model.
In order to solve the problem of limited training data, this paper proposes a classifier method combining Wasserstein GAN and CNN. Firstly, the original sample is divided into training set samples and test set samples, and the generated training network is used to enhance the data of the fault training samples to generate a large number of simulated samples. en, the samples mixed by generated samples and original samples are used to training the deep learning classifier based on CNN. Finally, the trained classifier is tested using test samples to verify the effectiveness of the method for the limited data problem.

Input of WG-CNN.
e input data are derived from the drive end bearing data. We directly use WGAN to extend the original dataset. e data are expanded as the input of the CNN classifier, and we use the CNN classifier for direct feature extraction.
is method is called an end-to-end learning training process.

Output of WG-CNN.
e output of WGAN is a generated dataset that is consistent with the shape and size of the original input dataset. e fusion dataset is input into the CNN classifier through fusion with the original data and the generated data. e composite dataset is reduced by the feature extraction of the convolution kernel and the feature dimension of the pooling layer. After the fully connected layer, the classification is performed using the activation function (8). Finally, the final classification situation is determined by the output of each class score.

WG-CNN Loss Functions.
Wasserstein distance is given in equation (2). e loss functions of the generator and discriminator are (9) and (10), respectively.

Mathematical Problems in Engineering
where P r and P θ are the distribution of the original data and the generated data and D(x) represents the output of the discriminator. Due to the interaction between the weight constraint and the cost function, the WGAN optimization process is difficult, which can cause the gradient to disappear or explode. e main reason is that WGAN performs gradient truncation. Gradient truncation will result in the discriminating network tending to a binary network; it will cause a drop in model capacity. According to [37], it is pointed out that gradient penalty is used instead of gradient clipping, so the loss function of the discriminator is expressed as follows: √√√√√√√√√√√√ √√√√√√√√√√√√ Original critic loss where [||∇ x D(x)|| p − 1] 2 is the gradient penalty and λ is the gradient penalty weight parameter.   Mathematical Problems in Engineering is problem is a typical multiclassification problem, so the loss function of the CNN classifier uses cross entropy loss. For sample point (x, y), y is the real label. In the multiclass problem, the value can only be the label set. We assume that there are K label values, and the probability that the i th sample is predicted as the k th label value is P i,k , that is, P i,k � P r (t i,k � 1) has a total of N samples, and the loss function of the model is expressed as follows:

WG-CNN Optimizers.
e RMSProp algorithm is called Root Mean Square Prop, which is used as the optimizer of the WGAN model. In order to further optimize the problem that the loss function has too large swing amplitude during the process of update and further accelerate the convergence speed of the function, the RMSProp algorithm uses the differential squared weighted average for the gradient of the weight W and the offset b. In the t th iteration, the formula is (13) to (16).
where α is the learning rate of the network, S dw and S db are the gradient momentums accumulated by the loss function in the previous t−1 round iteration, and β is an index of the gradient accumulation. Unlike other optimizations, the RMSProp algorithm calculates the differential squared weighted average for the gradient. is method is beneficial to eliminate the direction of the swing amplitude and to correct the swing amplitude, so the swing amplitude of each dimension is small; on the other hand, the network function converges faster. e stochastic gradient descent algorithm (SGD) is used as the optimizer of the CNN model, and the SGD updates the network model weights in combination with the gradient and the update weight of the previous iteration. e entire process can be represented by (17) and (18).
where W t+1 represents the weight of the network after the t + 1 th iteration and V t+1 is the update amount of the network weight in the t + 1 th iteration.

Training and Testing.
e WG-CNN framework for bearing fault diagnosis combined with WGAN and CNN is shown in Figure 4. e part marked by the green line represents the training process and the part marked by the blue line represents the testing process. e input to the sample is a three-dimensional mixing matrix consisting of a series of two-dimensional data. e output are the sample classes calculated according to softmax, then the generator loss function of the WGAN model is given by equation (9), the loss function of the discriminator is given by (10), the loss of the CNN model is given by (12), and finally the WGAN model is iteratively optimized by (13) to (16), and the CNN model is optimized by (17) and (18). We test the model by expanding the number of samples in the dataset, which is useful for a more comprehensive evaluation of the performance of the model, where the input is the original dataset or the enhanced dataset trained by our model, and the output of the model is the corresponding category of the prediction. e classification effect of the model is carried out through related evaluation indicators such as F1-score, precision, and recall.

Experiments and Results
e CWRU dataset is a basic dataset for verifying the performance of different machine learning (ML) and deep learning (DL) algorithms. In order to verify the performance of our method in the limited data fault diagnosis, we selected the drive end bearing health and fault data with the motor speed of 1730 rpm and the sampling frequency of 12 k as the original experiment data in the Case Western Reserve University (CWRU) bearing dataset.
ere are three types of bearing fault locations: ball faults, internal faults, and external faults. Each fault contains three types: 0.007 inches, 0.014 inches, and 0.021 inches, respectively. We have a total of 10 types (0-9, 0 for health and 1-9 for different types and sizes of faults). e specific classification is shown in Table 1.

Sample Size and Generation Effect.
In the first section of the experiments, we evaluated the effect of generating and expanding data by WGAN and solved two challenges first (a total of three challenges) in the limited data fault diagnosis: (1) the bearing fault has serious consequences, especially in the actual production, and the industrial system is not allowed to enter the fault state; (2) most motor faults occur very slowly and follow the degradation path, and the system degradation may take months or even years.
In order to maximize the data generation effect, we conducted a comparative test for the value of the gradient penalty coefficient λ. e experimental results are shown in Table 2. It can be seen from Table 2 that when the gradient penalty coefficient is 10, the experimental results have the highest accuracy. erefore, in this paper, we set the gradient penalty coefficient to 10 for subsequent experiments.
Firstly, in order to balance the sampling, we selected the same proportion of data from each kind of fault datasets for the experiment. e dataset partition and amount of each dataset are shown in Table 3. A is the original dataset, B is the input dataset of WGAN (randomly choosing 3%, reason is given later), and C is selected as the test dataset. Dataset D is the data generated by WGAN, which involves all the ten different types. e last dataset E is the enhanced dataset combined with the original dataset B and the generated dataset D.
Secondly, 14 groups of samples with the numbers of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 2000, 3000, and 4000 were selected to train the WGAN model. Figures 5   and 6 show the changes of the loss function value of the generator and discriminator with the data amount of 4000. In 100,000 iterations, the WGAN's generator loss value fluctuates from −1.573 to −0.009, and the WGAN's discriminator loss value fluctuates from −0.022 to +2.793. e floating trend changes greatly in the early stage and is stable in the middle and late stages. e values are constantly approaching zero, which shows that WGAN has improved the convergence effect of the traditional GAN significantly.
At the same time, in order to quantitatively analyze the generation effects of WGAN and GAN, this paper chooses the Fréchet distance (F) [38] as a metric and quantifies the generation effect by calculating the similarity between the original data and the generated data. e definition of similarity is S � 1/F. e results of the comparative experiment are shown in Table 4. From Table 4, it can be seen that the similarity between the original data and the data generated by WGAN is better, which proves the correctness of WGAN in this paper.
irdly, we use dataset B as the training set and dataset C as the test set of the proposed model. In order to design the optimal CNN classifier structure, we conducted a comparative experiment on the number of convolution kernels and the type of activation function. e results of the experiment are shown in Table 5. At the same time, the effect of batch processing size and learning rate on the model test accuracy was analyzed experimentally. e results are shown in Figure 7.  Figure 4: WG-CNN framework. e WG-CNN architecture is mainly composed of three parts: a generator (for generating data), a discriminator (to distinguish between the generated data from G and the data from the original dataset), and a classifier (to evaluate the effect of enhanced data). 12K-DriveEndFault_0.021_Outer  Table 6. A total of 14 groups, which involve the numbers of 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1200, 2000, 3000, and 4000, respectively, were trained and tested. Figure 8 shows the partial confusion matrix of the test results. Among them, health type 0 and fault types 4 and 6 are easy to identify, but fault types 3 and 9 are not easy to identify. It should be noted that as the amount of training data increases, the test results become better. When the amount of data in the training set reaches 4000, the method shows satisfactory classification result, and the test accuracy is more than 98%. erefore, in the next experiments, we usually choose 3% (4000/121556≈0.03) as the minimum selection proportion form original data. Experimental results show that the WGAN model can reduce the amount of data for training significantly; at the same time, it can meet the requirements of accuracy.
Finally, we selected representative categories 2, 4, 7, and 8 to display the effect of sample data generation. As shown in Figure 9, "real" is the real sample and "synthetic" is the generated sample. e orange line in the middle represents the mean of the data, the green line above is the mean plus the variance, and the blue line below is the mean minus the variance. Since the generated samples are standardized, they have a little different from the actual data ordinate values.
Among the four types of faults shown, categories 7 and 8 produced the best results, and the trend of generated sample was basically the same as the real sample. e variance of categories 7 and 8 is significantly larger than categories 2 and 4. It shows that the generated data are inevitably affected by the noise in real environments, and the applicability of the generated data needs to be further improved. In the following section, we will explore the comparison between different training data amount and model accuracy of the deep learning algorithm based on CNN classification.

Enhancement Data and Accuracy.
When selecting training samples in the CWRU dataset, many previous papers cannot guarantee balanced sampling, which means that the proportion of data samples selected from the normal state and the fault state is not close to 1 : 1. If most training sets are from health data, the learned features will not be suitable for fault classification, so this paper proposes an average sampling method in each type to deal with data imbalance.
On this basis, in order to describe the effect of different training sets more accurately, the average accuracy should not be used as the only index to evaluate the algorithm. Other indexes should be used to measure the effectiveness and reliability of the algorithm, such as precision, recall rate, specificity, and F1-score.
Precision is the ratio of correctly classified positive samples to the number of all classified positive samples. Recall (in binary classification also known as sensitivity) is In order to compare the differences between the generated dataset and the original dataset, we conducted the following experiments with the control variable method. In the first section, we concluded through experiments that when the amount of data in the set reaches 4000, the WGAN model has performed well, and the training dataset sample is only 3% of the original data. erefore, in the experiments of this section, we randomly choose 3% of the original data as dataset B to train WGAN to generate a new dataset.
Firstly, CNN was trained on dataset B, and dataset C was used as the test set to verify the fault classification effect of the model. e results are shown in Table 7 and Figure 10(a).
Secondly, we trained on dataset D and tested on dataset C to verify the fault classification effect of the generated dataset. e results are shown in Table 8 and Figure 10(b). As can be seen from the comparison experiment in Figures 10(a) and 10(b), the precision, recall rate, and F1score of the generated data are very similar to the original   dataset, which shows that CNN still has better performance in the generated data and the generated data are available. irdly, the experiment analyzes the classification effect of enhanced data. We input the original data into WGAN, training the model through the confrontation between the generator and the discriminator, and generate high-quality dataset D. en, the original dataset B and the generated dataset D are combined to get the enhanced dataset E; we train on dataset E and test on dataset C. e test results are shown in Table 9 and Figure10(c).
It can be seen from Figures 10(a)-10(c) that compared with the original dataset B and the generated dataset D, the CNN classification effect under the enhanced dataset E is greatly improved, and the F1-score of four items reaches 100%. Figure 10(d) shows the average value of each indicator of sklearn evaluation function of the three comparative experiments. e accuracy of CNN classification based on enhanced dataset E has obvious advantages. e experiment shows that WGAN is an effective data enhancement strategy for bearing fault classification.
To compare the performance of different enhancement algorithms, we randomly select 20% samples as the training set and 40% samples as the test set from original dataset of 5 random types of faults. For the other 5 types of faults, 10% of the samples are randomly selected as the training set and 40% of the samples are used as the test set. For the rare data, in addition to the SVM method, the other comparison methods adopt different enhancement strategies so that the number of training sets for each type of faults is 20% of the original data. SVM [39,40], CNN with oversampling, CNN with downsampling, and GAN-CNN are compared with the proposed model (WG-CNN).
e experiment results are shown in Table 10.
It can be seen from Table 10 that the benchmark model (SVM) has a poor classification effect on unbalanced datasets, and the average accuracy is only 75%. Two general data enhancement methods are used to enhance the CNN classifier, respectively. e accuracy of the classifier enhanced by the downsampling method is higher than that of the oversampling enhanced classifier. We also compare the original GAN generated data with the proposed model. e results show that the model proposed in this paper has a higher result, which can learn the distribution and characteristics of the data more accurately and provide high-quality generated data for the classification task.

Comparison of Different Algorithms.
In this section, we test different algorithms to compare the performance. For 10 different types of datasets, 20% of the original data were randomly selected as the input of the different algorithms. At the same time, 40% of the original data were randomly selected as the test dataset. In order to select the number of training iterations in the experiment, we define the algorithm efficiency (AE): rough 5 sets of comparative experiments, the efficiency of the algorithm under different iterations is obtained, as shown in Table 11.
rough the comparison experiment of algorithm efficiency, we found that when the number of iterations is 10000, the algorithm efficiency is up to 1.519, but the accuracy of the algorithm test at this time is only 0.41, so 10000 cannot be used as the number of iterations of the experiment in this paper. Combining algorithm efficiency and algorithm test accuracy, we found that when the number of iterations is 100000, the efficiency and accuracy of the algorithm have reached the desired effect, so 100000 is used as the number of iterations in the subsequent comparative experiments. At the same time, this also verifies the correctness of the selection of the number of iterations in our previous experiments. en, the experimental analysis for each comparative algorithm was performed. References give the proposed paper of the model, and we use each proposed model to conduct experimental analysis under the unified dataset. Table 12 shows the classification accuracy of different DL algorithms on the CWRU datasets.
It can be easily observed that the test accuracy of all CNN-based deep learning algorithms exceeds 70%, which proves the feasibility and effectiveness of using CNN for bearing fault diagnosis. However, under the same training data and test data settings, the accuracy between different models is very different. Among them, the original CNN model has the lowest accuracy for fault classification, which reached only 72.4%. e model with the highest classification accuracy for the dataset is the model proposed in this paper (WG-CAN), which reached 100%. is proves the effectiveness and practicability of the proposed algorithm.
As can be seen from the data in Table 12, the proposed WG-CNN learning algorithm can take into account the accuracy of classification while significantly reducing its dependence on the original data.

Noise and Classification Results.
e fourth section of the experiment is a test of the stability and robustness of the WG-CNN algorithm. In fact, all bearing defects in the CWRU dataset are drilled or engraved manually, which is easier to identify than actual bearing wear and general roughness due to aging. It can be seen from Table 12 that even with the classic and ordinary CNN, the CWRU dataset can achieve 70% excellent classification accuracy. is shows that the dataset contains relatively simple features, which can be easily extracted by a variety of DL methods. erefore, in the experiment, the original data provided by CWRU were used to train the model, and then the white Gaussian test samples of different SNR were added for testing. e SNR varies from −4 dB to 8 dB. Dataset B is used as the training set of CNN model and the proposed model, and different test sets are formed by adding different proportions of noise to dataset C.
In this experiment, we evaluate the classification effect of WG-CNN algorithm proposed in this paper to determine     whether we can solve the third challenge in the limited data fault diagnosis: the complex working conditions of mechanical bearings, so it is difficult to collect and mark enough training samples with reasonable noise. In the following formula and Table 13, signal-to-noise ratio (SNR) is defined as the ratio of signal power to noise power, usually expressed in decibels: SNR dB � 10 log 10 P signal /P noise ,     where P signal and P noise are, respectively, the power of the signal and the power of the noise. Figure 11 shows the comparison of the test results of the WG-CNN model in different noise environments. In the case of 4 dB and 6 dB SNR, the difference between the training set and the test set was small, and the test results were also better. When the SNR is −4 dB and −2 dB, the difference between the training set and the test set is large, and the accuracy is lower. WG-CNN is more robust and stable than traditional CNN.
Comparing Figures 11(a) and 11(b), we find that the average accuracy of WG-CNN is 40% at −4 dB and that of CNN is 32% at −4 dB. At high SNR (4 dB, 6 dB, and 8 dB), the average accuracy of WG-CNN is slightly higher than that of CNN, and the curve coincidence is significantly higher than that of CNN. At this time, the noise power is less than the signal power, which is closer to the actual working environment. Experiments show that WG-CNN has stronger robustness and stability than traditional CNN in noisy environments.

Conclusions
is paper proposes a few-shot learning neural network method for bearing fault diagnosis with limited data. is method is based on the WG-CNN model which combines Wasserstein generative adversarial network (WGAN) and convolutional neural network (CNN). It can be used for deep learning on limited samples, and it can effectively improve the performance of fault diagnosis. We solved three major challenges in the field of bearing fault diagnosis. Experiments show that WG-CNN can significantly reduce the number of training samples. When only 20% of the standard Case Western Reserve University (CWRU) bearing fault diagnosis reference dataset is selected, the classification accuracy reaches 100%, which is the highest among all the comparison papers. In the future, we plan to combine WG-CNN with transfer learning and domain adaptation and use this method to diagnose bearing fault in the actual field.

Data Availability
e simulation data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.