Oversampling Imbalanced Data Based on Convergent WGAN for Network Threat Detection

,


Introduction
In recent years, network attack and threat have gradually become two of the most important issues in the cyberspace. For example, in blockchain, big data, and IoT systems, the network attack occurs frequently, and global privacy leakage accidents caused by network threat are rising rapidly. It has caused great economic losses and undermining social stability [1][2][3]. Network threat detection based on machine learning and deep learning is one of the important ways to protect network security relying on massive data. However, the data sets collected through various ways are usually imbalanced. e number of the samples in each category varies significantly. In most cases, the benign category has a large number of samples, which is the so-called majority class. And the malicious category has much fewer samples, which is the so-called minority class. For example, in network intrusion detection, most traffic issues are normal with very few abnormal ones [4]. In Android malware detection, the proportion of the malware APPs is relatively low [5]. In the network threat detection, if the malicious samples as minority class are misjudged and widely spread, it will bring big losses to users. erefore, the correct and accurate detection of malicious data in imbalanced data sets is very important. Imbalanced data sets are so common that they bring many troubles to data mining and analysis [6]. However, when the number of samples of the minority class is very small, they are very likely to be discarded as noise data. It is very harmful for the training of classifier, thus reducing the accuracy of the minority class classification [7,8]. erefore, it is an urgent problem to strengthen the classification ability and improve the classification accuracy for the minority class.
So far, numerous methods and strategies have been proposed to solve the problem of imbalanced data, which are divided into two major categories, including data-level solution and algorithmic-level solution [9][10][11]. Data-level approaches focus on the data sets by undersampling the majority class, oversampling the minority class, or a combination of both to balance data sets. Undersampling is the key issue to reduce regular data from the majority class, which has a risk of losing important information. Oversampling is to generate some new data close to the minority class, which is popular due to using all the available information and significantly improving the overall performance of data classification. Synthetic minority oversampling technique (SMOTE) [12] is the most representative and popular oversampling algorithm, which generates synthetic minority examples based on the linear interpolation between each minority example and its nearest neighbors. Algorithmiclevel approaches improve the algorithms to focus on the classification accuracy of the minority class, which can be divided into four types: cost-sensitive methods, kernel-based learning methods, active learning methods, and ensemble methods [10,13]. ere are a variety of common algorithmiclevel methods, such as a Bayesian optimization algorithm, which maximizes Matthew's correlation coefficient by learning optimal weights for the positive and negative classes [14].
However, when the amount of data is big and the dimension is high, the traditional methods mentioned above are not so effective. But machine learning and deep learning techniques appear in time to make up for the shortcomings of the traditional methods [15]. Generative adversarial network (GAN) is a typical generative model, which is regarded as a potential solution for imbalanced data by generating new samples for the minority class [16,17]. GAN is composed of two neural networks: generator and discriminator. e generator is used to generate complex and high-dimensional data close to the real data as far as possible to deceive the discriminator. e goal of the discriminator is to identify the real data and the false data generated by the generator as far as possible. rough multiple adversarial training, a generator with excellent performance is obtained [18]. However, GAN has some problems, such as unstable training, gradient disappearance, and mode collapse [19,20]. In order to solve these problems, many variant algorithms have been proposed. Wasserstein GAN (WGAN) is one of the most typical algorithms [21]. WGAN uses Wasserstein distance instead of Jensen-Shannon (JS) divergence to make training stable. Most important of all, WGAN not only solves the unstable training problem of GAN and does not carefully balance the training of generator and discriminator but also solves the collapse mode problem and ensures the diversity of the generated samples. Another advantage of WGAN is that it can deal with the discrete data. erefore, we apply WGAN as an oversampling method to generate the new minority samples to solve the imbalanced problem.
WGAN is still not perfect with some defects like GAN, such as training difficulty and slow convergence in the real experiment process. erefore, we tend to construct the condition of alternate training between the generator and discriminator. Inspired by the idea of letting the convergence of the discriminator guide the training stability of WGAN, we propose a convergent WGAN-based oversampling model called CWGAN, which contains a generator and a convergent discriminator. e generator is used to generate new samples close to the real ones. And the discriminator is used to distinguish the real minority samples and the new samples generated by the generator. In the training process of CWGAN, the generator and the discriminator are trained alternately many iterations. In each iteration, the epoch of the discriminator is dynamic, which is determined by the convergence of loss function in the last two iterations. When the discriminator is trained to convergence, the generator will be trained. rough the process, we can achieve the epochs of the discriminator in each iteration.
e proposed method includes the following advantages: (1) Compared with the WGAN, the proposed CWGAN constructs the condition of alternate training between the generator and discriminator to guide the training stability. (2) During the training process of CWGAN, the epochs of the discriminator are dynamic, which is determined by the convergence of loss function in the last two iterations. (3) CWGAN model can solve the imbalanced data problem through generating the new samples for the minority class to make the data set balance. (4) CWGAN not only improves the accuracy of the minority class but also enhances the accuracy of the majority class and the accuracy of the whole data set. e remainder of this paper is organized as follows. Section 2 discusses the related work about GAN and imbalanced data and GAN training imbalanced. A description of the CWGAN model is provided in Section 3. Experimental results and analysis are presented in Section 4. e work is concluded in Section 5.

GAN and Imbalanced Data.
Class imbalance is a common issue in network space for classification tasks. Generating the minority samples is regarded as an effective way to solve the class imbalance problem, which is classified as oversampling. Random oversampling, synthetic minority oversampling technique (SMOTE) [12], and borderline SMOTE [16] are considered to be the best traditional oversampling algorithms. However, when the data have a high dimensional space, the performance reduces significantly. By comparison, GANs can generate artificial data close to the minority class when the data are high dimensional and data distributions are complex. So far, researchers have produced many modified versions of GANs.
Hao et al. [22] proposed an Annealing Genetic GAN (AGGAN) method to solve the class imbalance problem by generating data close to the minority class distribution based on the limited data samples. In the training process, AGGAN used the mechanism of simulated annealing (SA) to update the training of GANs and avoided the local optimum to get the best generator, which could generate data close to the minority classes. Both theoretical analysis and experimental studies showed that AGGAN made the classes balance efficiently and effectively through reproducing the distributions.
Based on a conditional Wasserstein GAN, Engelmann and Lessmann [9] proposed an oversampling method called cWGAN-based model to balance the tabular data with numerical and categorical variables. In the cWGAN-based structure, the conditional distribution is estimated to sample the minority class explicitly. e loss function of the cWGAN-based model is added with an auxiliary classifier (AC) loss to encourage the generator to generate samples. e experiments showed that the cWGAN-based architecture could successfully estimate the data distribution and outperforms other oversampling methods.
Deepshikha and Naman [14] proposed a new GAN framework consisting of a generator network G, a discriminator network D, and a classifier network C, where the generator worked with the classifier to generate some samples in a min-max game with the discriminator to make class balance. In the framework, the generator could generate samples in the convex hull through competing with the classifier. Consequently, the generated samples appeared on the data boundary with better distribution, which could make the classifier easy to find the decision boundary of the whole data set with original imbalanced data and generated data. e experiments showed that the proposed method with an additional classifier achieved better performance than other state-of-the-art techniques on the image data sets.
Kim et al. [23] proposed a novel GAN-based model consisting of an autoencoder as the generator and two separate discriminators for anomaly detection on image data sets. Meanwhile, they proposed new loss functions consisting of patch loss and anomaly adversarial loss to optimize the GAN-based model and improve the robustness and performance for defect detection.
ere were eight loss functions: six losses for the generator, one for the normal discriminator, and another for the anomaly discriminator.
e experiments were carried out on the benchmark data sets and real-world data sets.
e experimental results showed that both of the performances of benchmark data sets and real-world data sets were state of the art.
In summary, since GAN was proposed by Goodfellow, it has been widely used to generate new class samples to solve the imbalanced problem. And the experimental results show that oversampling based on GAN can significantly improve the classification performance of the minority class. However, most researchers focus on solving the imbalance problem of image data set, and few on the numerical data. Actually, the image data are very different from the numerical data. erefore, when solving the imbalanced problem of the network security data, it is necessary to adopt GAN-based methods to improve the detection performance of network threats.

GAN Training
Imbalanced. GAN is one of the hottest deep learning models, but it is hard to train, and the Nash equilibrium is almost impossible to achieve between the discriminator and generator during the training. Training instability is thought as one of the defects [19,20]. Many researchers have put forward their own ideas to solve the problem and design improved models suitable for specific tasks. New research articles about modified versions of GAN are published week by week, with the names such as 3D-GAN, BEGAN, and iGAN. Avinash Hindupur builds a website called " e GAN Zoo" [24] on GitHub and lists the new GAN models.
Many variants based on GAN have been proposed to improve the training stability and performance. According to the composition and loss, GAN models can be divided into two categories: architecture variants and loss variants. e former ones are to benefit for specific applications, and the latter ones are to improve the performance and enable more stable training [19].
e schemes belonging to loss variants are divided into three types: restricting loss function of the generator [25], restricting loss function of the discriminator [26,27], and simultaneous restricting loss function of the generator [28,29]. Our work focuses on the training stability to achieve approximately the Nash equilibrium by restricting loss function of the discriminator to improve the performance.
Heusel et al. [30] proposed a two time-scale update rule (TTUR) for training GAN models with stochastic gradient descent on loss functions. e TTUR defined different individual learning rates for both the discriminator and the generator and proved the learning rates in depth. Finally, the TTUR converged under a local Nash equilibrium. Experiments showed that the TTUR improved the performance of the generator on image data sets.
Brock et al. [31] studied the instabilities of the large-scale GANs. On one hand, two simple, general architectures with two to four parameters and eight times the batch size were introduced to improve scalability. On the other hand, the generator used orthogonal initialization to make it suitable for a simple "truncation trick," which got the tradeoff between sample variety and fidelity by reducing the variance of generator inputs. e experimental results achieved natural images of multiple categories and improved the conditioning, demonstrably boosting performance of GANs.
Mescheder et al. [32] pointed out that unregularized GAN training was not always convergent locally and the Nash equilibrium led GAN hard to achieve training stability. ey analyzed the convergence properties and discussed some regularization strategies, which made GAN training stable including adding instance noise and gradient penalties. In the experiment part, some gradient penalties were used to prove local convergence for the regularized GAN training dynamics.
Zadorozhnyy et al. [33] proposed a new family of discriminator loss functions consisting of real and fake data. According to the gradients of the loss, weights were computed adaptively to train a discriminator to obtain the stability. ese methods can be used for any discriminator model with a loss summed of the real and fake data. e experiment results showed that the methods were more effective on the image generation task.
Lee and Seok [34] proposed that it was essential to maintain balanced training between the discriminator and generator. ey thought multiple updates and imbalanced learning rates could train balance. e two approaches directly trained the discriminator severer than the generator. However, the learning rates were dynamic according to the different GAN architecture, data sets, and tasks, which needed to be further studied to select the suitable number of multiple updates and values of learning rates.
Sidheekh et al. [35] demonstrated that it was a problem to use duality gap to monitor the training progress of a GAN in the prevalent approach and proposed a dependable measure to estimate the duality gap based on the locally perturbed gradient descent, which could overcome the limitations of duality gap's capability. In the experiments, the researchers demonstrated their proposed method based on duality gap could influence the training process, such as tuning hyperparameters, a wide variety of GAN models and data sets. And the proposed method had ability to identify whether model convergence or divergence could measure the potential performance of GANs. e training of GAN is an important and difficult problem. At present, many schemes have been put forward. e ideas of most algorithms are to balance the training of GAN by changing the loss of the generator and discriminator. Inspired by this, we can judge the convergence of the discriminator by calculating the loss trend of the discriminator in the training process and use it as the basis for determining the epochs of subsequent training.

Proposed Solution
Our proposed network threat detection model based on CWGAN is shown in Figure 1. e architecture consists of three parts: data segmentation, deep learning for oversampling, and shallow learning for classification. e details and process of the architecture are as follows: (1) Data segmentation: at first, the original imbalanced data sets are divided into the training data sets and testing data sets, respectively. Both the training data set and testing data set are imbalanced, including the majority class and the minority class. en, the minority class is input to the deep learning model for oversampling.
(2) Deep learning for oversampling: CWGAN works as a deep learning model, containing a generator and a convergent discriminator. e structures of generator and discriminator are composed of full connection layers. e generator is used to generate new data close to the real data. And the discriminator is used to distinguish the real minority data and the false data generated by the generator as far as possible.
Overall, the training process of CWGAN contains multiple iterations. In each iteration, the training epochs of the discriminator are dynamic, which is determined by the convergence of loss function in the last two iterations. In each iteration, the training process is divided into two stages. e first stage is to fix the discriminator to train the generator. At the beginning, the generator is relatively weak, and the generated data are easy to be identified by the discriminator. With the training keep on, the performance of the generator is improved. e second stage is to fix the generator to train the discriminator for distinguishing the real data and the generated false data. e loss function of the discriminator is computed between the real data and false data. en, the convergence of the loss function is proved through the convexity and Lipschitzness continuity conditions. When the discriminator converges, the epochs of the loss of the previous iteration and that of the current iteration are computed, which are to determine the number of iterations in the next iteration. erefore, the epoch is used as the index to balance the two iterations of convergence.
(3) Shallow learning for classification: after the CWGAN generates the new minority samples for the minority class in the training data sets, the new minority samples and the original training data sets are fused to form the new balanced training data sets. en, the new balanced training data sets are input to shallow learning models to train the model. en, the imbalanced testing data sets are input to the trained shallow learning models to predict the labels. At the same time, the accuracy of the majority class in the testing data set as acc+, the accuracy of the minority class in the testing data set as acc− , and accuracy, precision, recall, F1, and G-means of the whole testing data set are computed to evaluate the performances.

Data Segmentation and Data
Imbalanced. e original imbalanced data sets are divided into training data sets and testing data sets, respectively. Both the training data set and testing data set are imbalanced, including the majority class and the minority class. e data object we deal with is the training set.
Suppose the training set is X � x 1 , x 2 , . . . , x N and the size is N. e majority class is X d � x d1 , x d2 , . . . , x dN + , and the size is N + . e minority class is X s � x s1 , x s2 , . . . , x sN − , whose size is N − . e imbalance ratio (IR) of the training set is computed based on the size of the majority class divided by the size of the minority class, which is defined as follows: IR indicates the degree of the imbalanced data sets. e higher the IR is, the more imbalanced the data set is, and the more data need to be generated. In order to achieve the balance of different class samples, it is necessary to generate (N + − N − ) new samples for the minority class. (N + − N − ) is close to (IR − 1)N − , which can also determine the number of the generated samples.

Deep Learning Model Training.
e WGAN model is built with a generator and a discriminator. e generator is used to generate new samples for the minority class. And the discriminator is used to distinguish the real minority samples and the generated samples. Suppose the input data of the generator are the noise data X Nosie , and the input data of the discriminator are the real minority data X s and the false data generated by the generator are X f . e WGAN model uses full connection layers to construct the structure of the generator and discriminator. e loss function is computed based on Wasserstein distance, which is defined as where P G is the output data of the generator and P d is the output data of the discriminator. W(P d , P G ) is the set of all possible joint distributions of the combination of P G and P d .
For the joint distribution c, the data set X s and X f conform to the distribution (X s , X f ) ∼ c. ‖X s − X f ‖ represents the distance between the samples. e expected value of sample distance under the joint distribution is , whose lower bound is defined as Wasserstein distance.
In order to solve the equation, the K-Lipschitz constraint is introduced, which is defined as where f w (·) is the function of the neural network model and K is the Lipschitz constant. e approximate solution of formula (2) becomes e parameter K makes the gradient K times larger but does not affect the direction of the gradient. Formula (4) can be expressed as a discriminator network with parameter w and the last layer is not a nonlinear active layer. en, formula (4) takes the upper boundary of all functions f w (·), which satisfies the K-Lipschitz restriction. en, the Wasserstein distance is converted to Next, the generator is to approximately minimize the Wasserstein distance, which is equivalent to minimize L in formula (5). Considering that the first term of formula (5)

Security and Communication Networks
independent of the generator, we can get the discriminator loss and generator loss of WGAN in formulas (6) and (7).
Formula (6) is the inverse of formula (5), and formula (6) can indicate the training process. e smaller the value of formula (6) is, the smaller the Wasserstein distance between the real data and the generated data is, and the better WGAN training is.

Model Training Convergence.
After the model is established, the model enters into the iterative process of training. e whole iterative process contains many iterations, and each iteration contains many epochs. In each iteration, the generator and the discriminator are trained alternately. e training process is the process of reducing the discriminator loss and the generator loss. In our work, we design a convergence rule, when the generator is fixed, the number of training epochs of the discriminator is not fixed, which is determined dynamically according to the convergence of the loss function.
Assuming that the distribution of the generated data is similar to that of the real data, in one iteration of training process, the discriminator's recognition ability is continuously enhanced and eventually converges. e convergent discriminator can correctly classify the generated data and the real data.
When the discriminator converges, the epochs in the next iteration are determined by the change times of the discriminator loss between the previous iteration and the current iteration. erefore, we use the change times of the discriminator loss as the index to balance the training of the discriminator in the next iteration.
, which is convex and K-Lipschitzness. e updating of the discriminator loss function is where x ∈ [X f , X s ], and α (s,t s ) is the stride. Suppose x * is the optimal solution of L (s,t s ) D (f w (x)), which is received at the t s -th iteration. L * D is the optimal value of L (s,t s ) D (f w (x)) and expressed as L (s,t s ) D . K is the Lipschitzness constant, and the value is 1. erefore, we can get In formula (9), the first inequality satisfies the convexity of functions and the second inequality is based on the Lipschitzness continuity condition. By accumulating formula (9), the result is obtained as Multiply both sides of formula (10) by K 2 to get According to the right bounded, we can get Under the condition of L (s,t s ) In this paper, the data are high dimensional. We use the Wasserstein distance to measure the distance between generated data and real data. Suppose W k represents the Wasserstein distance between generated data and real data when the discriminator is convergent at the k-th iteration. e formula (13) is transformed to

Calculation of Discriminator Training Epochs.
Under the conditions of formulas (12) where ⌊ · ⌋ is the rounded-down value of the data. at is to say, the epoch number of discriminator training in each iteration depends on the ratio of the minimum loss of the previous two iterations of discriminator training convergence, which is rounded down. If the ratio is rounded down to 0, the epoch of discriminator training is 1.
Moreover, it is noted that there are many math symbols in this section. So, we make a new symbol table for all used math symbols in Table 1.

Shallow Machine Learning Classifier.
Shallow machine learning has good performance and high efficiency to learn and analyze the data [36]. ere are many shallow machine learning classifiers, e.g., NB, RF, and LR. Based on our previous experimental results and analysis of the existing literature, we find that SVM has some advantages, such as the stability for maintaining good classification performance, the ability of dealing with the noise and outlier data, and effectiveness for handling the nonlinear and high-dimensional data. erefore, we choose SVM as the classifier in our experiment.
Shallow machine learning consists of two stages: training stage and testing stage. In the training stage, the generated samples based on CWGAN combined with the original training data set are input to the shallow machine learning classifier to train and obtain the optimal model structure. In the testing stage, the imbalanced testing data set is input to the trained shallow machine learning model to get the labels of the predicted testing data.
In the experiment, the true labels of the testing data set have been known, so the performance of the shallow machine learning models, such as accuracy, precision, recall, and F1, can be obtained by comparing the true labels with the predicted labels and calculating the confusion matrix. In this work, the accuracy of the majority class in the testing data set as acc+, the accuracy of the minority class in the testing data set as acc− , and accuracy, precision, recall, F1, and G-means of the whole testing data set are computed to evaluate the performances. e confusion matrix for binary classification includes four index items, such as true positive (TP), false negative (FN), false positive (FP), and true negative (TN). e accuracy of the minority class in the testing data set as acc− is computed on the minority class: e accuracy of the majority class in the testing data set as acc+ is computed on the majority class: Accuracy, precision, recall, F1, and G-means of the whole testing data set are computed:

Data Set.
e data sets are downloaded from the website of Canadian Institute for Cybersecurity, which is based at the University of New Brunswick in Fredericton [37]. ere are many security data collected on the website, which are used by companies, research centers, and universities. In our experiment, we download the URL data set, which contains some different types of URLs. And 79 features are selected from the original data, such as URL, domain, path, file name, Security and Communication Networks and argument [38]. en, we just extract some samples from the data set to form some imbalanced data sets, ranking from low to high according to IR. e details of the data sets are shown in Table 2.
Benign URLs. ey are collected from Alexa top websites. First, the domains are crawled by a Heritrix web crawler. en, around half a million unique URLs are extracted from the domains. Later, the URLs are filtered by VirusTotal from the extracted URLs. Malware URLs. ey link to the malicious websites and are owned to DNS-BH, which maintains a list of the malicious sites. Defacement URLs. ey are fraudulent or hidden URLs ranked by Alexa and linked to the malicious Web pages. Spam URLs. ey are collected from the publicly available WEBSPAM-UK2007 data set.
In the experiments, the original data set is divided into the training set and the testing set according to the ratio of 7 : 3. e training set is used to train the model, and the testing set is used to validate and evaluate the model.
All experiments are preformed in JetBrains PyCharm 2017 with the Python 3.6 interpreter on a laptop with Intel CORE i5-6200U 2.3 GHz with 8 GB RAM running on the Windows 10 OS.

Performance of Original Imbalanced Data Set.
In the first part of the experiments, the performances of the original imbalanced data sets are computed. First, the original imbalanced data sets are divided proportionally into the training data sets and testing data sets, respectively. Both the training data set and testing data set are imbalanced. Second, the training data set is input to SVM to train the model. SVM is viewed as a traditional and effective classifier. ird, the testing data set is input to the trained SVM to predict the labels. At the same time, the accuracy of the majority class in the testing data set as acc+, the accuracy of the minority class in the testing data set as acc− , and accuracy, precision, recall, F1, and G-means of the whole testing data set are computed to evaluate the performances, which are shown in Table 3. In addition, it should be noted that the recorded results are the average of 5 experiments.
According to the performances of original imbalanced data sets in Table 3, we can find some interesting results. First, with the increase of IR, the acc+ values representing the accuracy of the majority class are increasing. By comparison, the acc− values representing the accuracy of the minority class are decreasing. ese results are consistent with the conclusion, which is that the boundaries of the classifiers tend to be the majority class and lead to the classification error of the minority class. Second, with the increase of IR, accuracy, precision, recall, and F1 of the whole testing data sets are increasing, and also G-means of the whole testing data sets are increasing. at is because accuracy, precision, recall, and F1 represent the overall performance of the testing data sets. e accuracy of the majority class can ensure the overall accuracy. e growth trend of acc− is the same as that of G-means. Both of them decrease with the decrease of IR. So, G-means can represent the performance of the minority class. Furthermore, we can obtain Table 1: e symbols are used in the formulas from (1) to (15) in Section 3.3.

No. Math symbols
Meaning of the math symbols 1 X e training set 2 N e size of the training set X 3 X d e majority class of the training set 4 N + e size of the majority class in the training set 5 X s e minority class of the training set 6 N − e size of the minority class in the training set 7 IR e imbalance ratio of the imbalanced data sets 8 R e oversampling rate of the minority class 9 X f e false data generated by the generator of GAN 10 P G e output data of the generator 11 P d e output data of the discriminator x * e optimal solution of L (s,t s ) W k e Wasserstein distance between generated data and real data when the discriminator is convergent at the k-th iteration.
some conclusions. On one hand, the overall performance of the whole data sets cannot reflect the performance of the imbalanced data. On the other hand, when the IR is very high, the accuracy of imbalanced data is very low.
In the field of cyberspace security, network threat data are usually viewed as the minority class. If the minority data are misclassified so that achieve a low accuracy, the consequences can be serious. erefore, it is necessary to solve the problem of the imbalanced data set. is is the goal of our work. In the following parts, we will introduce the results of our proposed method to generate the sufficient new minority samples to solve the imbalanced problem and improve the accuracy of the minority class.

Comparison between WGAN and CWGAN.
WGAN can generate the new data based on the original data set to solve the problem of the imbalanced data set. In this work, we propose CWAGN to improve the training stability to improve the performance of WGAN. In this part, the performances of WGAN and CWGAN are computed and compared. First, the original imbalanced data sets are divided proportionally into training data sets and testing data sets. e number of the needed new minority samples of the training data sets equals the number of the majority class minus the number of the minority class. Second, WGAN and CWGAN work to generate the new minority samples for the training data sets. en, the new minority samples and          Tables 4 and 5 are computed based on the output of the trained SVM on the balanced training data sets, which are fused with the new generated minority samples based on WGAN and CWAGN. On the whole, comparing the results in Tables 3-5, we can conclude that acc− and G-means in Tables 4 and 5 increase obviously compared with that in Table 3; acc+, accuracy, precision, recall, and F1 in Tables 4 and 5 decrease a little compared with that in Table 3. at is because WGAN and CWAGN generate some new generated minority samples to make the original data sets balance, which train SVM to achieve a model that focuses on the minority class. Comparing the results in Tables 4 and 5, acc− and G-means in Table 5 based on CWGAN are higher than that in Table 4 based on WGAN, especially when IR is high; acc+, accuracy, precision, recall, and F1 of some data sets in Table 5 are a little higher than that in Table 4. erefore, we can conclude that the performances based on CWGAN generating the new minority samples outperform those based on WGAN.

Discriminator Loss of WGAN and CWGAN.
During the training stage, the discriminator loss is achieved based on the cross entropy loss function to compare the probability that the predicted labels of the generated data are close to that of the real data. e smaller the discriminator loss, the closer the generated data to the real data. In this section, we study the trend of the discriminator loss of WGAN and CWGAN on different data sets. e epoch of WGAN in each iteration is fixed to 5. e epoch of CWGAN in each iteration is computed when the model is trained to convergence in one iteration. e numbers of the training iterations of WGAN and CWGAN are 1000. e loss of each iteration in one experiment is recorded and plotted in Figure 2. Figures 2(a)-2(h) shows the discriminator loss curves of WGAN and CWGAN on D1 to D8. ese curves show that the discriminator loss of CWGAN is smaller and more stable than that of WGAN with the increase of the iterations, and the values of the discriminator loss of CWGAN are very close 0. e discriminator loss measures the similarity between the generated data and the real data. e results reflect that the generated data based on CWGAN are closer to the real data than that based on WGAN. erefore, combined with the experimental results in Section 4.2.2, we can ensure that the performance of CWGAN outperforms that of WGAN.

Training Convergence of CWGAN.
e training of WGAN with a generator and a discriminator is imbalanced. To make the training process trend to Nash equilibrium, we fix the training epoch of the generator and compute dynamically the training epoch of the discriminator according to the convergence of the training loss. In each iteration, the value of the training epoch of the discriminator is determined by the ratio of training loss in the last two iterations. In this part, we compute the training epoch of the discriminator in each iteration on different data sets. e results are plotted in Figure 3. In addition, it should be noted that the number of the training iterations of CWGAN is 1000. e recorded results are from one of some experiments. Figure 3 shows the training epoch of the discriminator in each iteration on different data sets. We divide the training process into two stages. In the early stage, the values of the training epoch change frequently and are unstable. In the later stage, the values of the training epoch are stable. In addition, in most cases, when the training process converges, the higher the IR, the more samples generated by CWGAN, and the higher the training epoch. Such as the training epoch of D8 is highest, whose IR is highest.

Comparison between the Oversampling Methods.
At present, there are many oversampling methods, such as VAE and SMOTE. In this section, we compare the performance between the oversampling methods. e experimental process of the oversampling methods is similar to that in Section 4.2.2. First, the oversampling methods, including VAE, SMOTE, WGAN, and CGWGAN, are applied to generate the minority examples and combined with the original training data sets to train SVM. Second, the testing data sets are input to the trained SVM to predict the labels. At the same time, acc+, acc− , accuracy, and G-means are computed to evaluate the performances, which are plotted in Figure 4. In addition, it should be noted that the numbers of training iterations of WGAN and CWGAN both are 1000. e recorded results are the average of 5 experiments. In Figure 4(a), the accuracy of the minority data acc− is highest on the same data sets based on CWGAN. Especially, on D6, D7, and D8, whose IR is higher, the accuracy on most data sets are improved most after CWGAN generating the minority samples to make the imbalanced data sets balance. In Figure 4(b), the comparison on G-means is similar to that of acc− . G-means on most data sets based on CWGAN are highest. We can conclude that the new generated minority samples based on CWAGN can improve the performance of the minority class most. In Figures 4(c) and 4(d), in most cases, the accuracy of the majority class acc+ and the accuracy of the whole data are close to those of the original data, which means that CWAGN cannot weaken the accuracy of the majority class and the whole data.

Conclusion
WGAN as a generative model can solve the imbalanced data problem through generating new samples for the minority class. However, it may suffer mode collapse and the training cannot converge sometimes. In this paper, we improved the training process to realize the training stability of WGAN by proposing a convergent WGAN-based oversampling model called CWGAN. In each iteration of the training process, the epoch of the generator is fixed, but the epoch of the discriminator is dynamic, which is determined by the convergence of discriminator loss function in the last two iterations. e experiment results showed that CWGAN not only improves the training stability of WGAN on the loss smooth and more close to 0 but also improves the performance of the minority class through oversampling and ensure the accuracy of the majority class and the whole data. Compared with other oversampling methods, it can be concluded that CWGAN performs better and enables stable training.
Data Availability e data in experiments are downloaded from the public websites. e links are cited in [37]. In addition, the datasets are also available from the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.