Deep Learning-Based Imbalanced Classification With Fuzzy Support Vector Machine

Wang, Ke-Fan; An, Jing; Wei, Zhen; Cui, Can; Ma, Xiang-Hua; Ma, Chao; Bao, Han-Qiu

doi:10.3389/fbioe.2021.802712

ORIGINAL RESEARCH article

Front. Bioeng. Biotechnol., 21 January 2022
Sec. Bionics and Biomimetics
Volume 9 - 2021 | https://doi.org/10.3389/fbioe.2021.802712

Deep Learning-Based Imbalanced Classification With Fuzzy Support Vector Machine

Ke-Fan Wang¹ www.frontiersin.org

Jing An¹

Zhen Wei²* www.frontiersin.org

Can Cui³

Xiang-Hua Ma¹ www.frontiersin.org

Chao Ma¹

Han-Qiu Bao³

¹School of Electrical and Electronic Engineering, Shanghai Institute of Technology, Shanghai, China
²School of Design, East China Normal University, Shanghai, China
³College of Electronic and Information Engineering, Tongji University, Shanghai, China

Imbalanced classification is widespread in the fields of medical diagnosis, biomedicine, smart city and Internet of Things. The imbalance of data distribution makes traditional classification methods more biased towards majority classes and ignores the importance of minority class. It makes the traditional classification methods ineffective in imbalanced classification. In this paper, a novel imbalance classification method based on deep learning and fuzzy support vector machine is proposed and named as DFSVM. DFSVM first uses a deep neural network to obtain an embedding representation of the data. This deep neural network is trained by using triplet loss to enhance similarities within classes and differences between classes. To alleviate the effects of imbalanced data distribution, oversampling is performed in the embedding space of the data. In this paper, we use an oversampling method based on feature and center distance, which can obtain more diverse new samples and prevent overfitting. To enhance the impact of minority class, we use a fuzzy support vector machine (FSVM) based on cost-sensitive learning as the final classifier. FSVM assigns a higher misclassification cost to minority class samples to improve the classification quality. Experiments were performed on multiple biological datasets and real-world datasets. The experimental results show that DFSVM has achieved promising classification performance.

Introduction

In many fields, the distribution of data is imbalanced and the problem of imbalanced datasets occurs when one class is much larger than the other. For example, in disease diagnosis (Bhattacharya et al., 2017; Loey et al., 2020), most of the data are healthy, and it is difficult to obtain data on diseases. Moreover, with the deployment of various monitoring systems, more and more data are collected in smart cities and the Internet of Things, but there are a lot of data on the normal operation and abnormal data is rare (Du et al., 2019; Fathy et al., 2020). More specifically, this problem occurs when one class outnumbers the other class, which are usually referred to as majority and minority class, respectively (Tao et al., 2020). The majority class samples are more easily available, while the minority class samples are more difficult to obtain data due to natural frequency of occurrence or data collection. The imbalanced data distribution also exists in the fields of fraud detection (Li and Wong, 2015; Jiang et al., 2018), computer security (Wang and Yao, 2013), intrusion detection (Yao et al., 2018), drift detection (Wang et al., 2021), image recognition (Romani et al., 2018) and defect detection (Li et al., 2020). In machine learning, there are many well-established classification methods, such as decision tree, logistic regression, support vector machine and extreme learning machine (Shi et al., 2022), but they are based on the assumption of uniform data distribution and have over-all accuracy as the optimization goal. When traditional classification methods are used to deal with imbalanced classification, the result are more in favor of the majority class and ignore the importance of the minority class. Although the overall accuracy is relatively high, the minority class data with important information cannot be accurately identified.

Many imbalance classification algorithms have been proposed in recent decades. These algorithms can be generally divided into two main types: data-level and algorithm-level (Tao et al., 2020). The data-level approaches first bring the original imbalanced dataset to balanced distribution by some sampling processing, and then classify it by using a traditional classifier. The algorithm-level approaches attempt to improve existing classification algorithms by reducing their bias for the majority class data, and thus adapt to imbalanced data distribution.

In this paper, a novel imbalance classification method based on deep feature representation is proposed, named DFSVM. First, from the perspective of data features, a deep neural network is used to obtain the embedding space features. Appropriate feature representation can improve the classification performance of models, and it also enhances the differentiation of features of different classes and the similarity of feature areas of the same class. In addition, it will provide a basis for the effective recognition of samples. The deep neural network has a complex nonlinear network structure, which can effectively extract the deep features of samples. When training the network, a triplet loss function (Schroff et al., 2015) is used to enable the network to separate the features of minority class and majority class. Additionally, Gumbel distribution function (Cooray, 2010) is applied as an activation function in the activation layer. This function is continuously differentiable, and it can be easily used as an activation function in stochastic gradient descent optimization neural networks. The original input samples are mapped to the same embedding space after feature extraction. In the embedding space, a new minority class sample is randomly generated based on the distance between the sample and the center of the class, which makes the data distribution balanced. After obtaining the embedding features of samples, a fuzzy support vector machine (FSVM) (Lin and Wang, 2002) is used to classify. FSVM introduces membership values (MVs) in the objective function of traditional support vector machine, and it sets different misclassification costs for different classes samples. The misclassification cost of the minority class is higher than that of the majority class. FSVM is a cost-sensitive learning strategy that can effectively improve the recognition rate of the minority class samples. In addition, traditional classification methods use accuracy as classifier evaluation metrics, but classifiers with accuracy as evaluation metrics tend to reduce the classification effectiveness of the minority class. Moreover, accuracy limits the effect of minority class samples on classification performance. Therefore, this paper uses G-mean, F-measure and AUC values to evaluate the classification results more comprehensively.

The rest of this paper is organized as follows. In Related Work Section, the related work on imbalance classification is presented. Proposed Method Section describes DFSVM. In Experiments and Results and Conclusion Sections, the experimental results and conclusions are introduced.

Related Work

The imbalance of data distribution and the limitation of traditional classification algorithms are the main problems that imbalanced classification faces, therefore, researches on imbalanced classification can be divided into two levels: data-level and algorithm-level.

Data-Level

Data resampling is the most representative method of data-level, which reduces the imbalanced ratio (IR) by changing the data distribution. The undersampling algorithm reduces the bias of model to the majority class by reducing the number of samples in the majority class. Random undersampling is the simplest approach, it randomly selects and removes part of the majority class samples. However, random undersampling easily leads to the deletion of potentially useful information, so some heuristic methods are proposed.

Neighborhood cleanup rule (NCL) (Laurikkala, 2001) uses an instance-based approach to reduce larger classes and considers carefully the quality of the data to be removed. To reduce the impact of some noisy minority examples on the performance of classifiers, Kang et al. (2016) proposed a new undersampling algorithm by introducing a noise filter. The weighted under-sampling of SVM (WU-SVM) groups majority samples into some subregions and assigns different weights based on their Euclidean distance to the hyper plane to retain the data distribution information of original dataset (Kang et al., 2017). The other popular sampling method is oversampling, which is used to balance the data distribution by increasing the number of minority class samples. Random oversampling can cause overfitting, so heuristic methods are also mostly used. The most representative one is the synthetic minority oversampling technique (SMOTE, Chawla et al., 2002). SMOTE generates a new minority sample by interpolating between k nearest minority neighbors. However, due to the irregular data distribution, new samples generated by SMOTE may become noise, which may increase the overlap between classes and lead to misclassification. In order to generate more reasonable samples, some variants of SMOTE have been proposed, such as Bordeline-SMOTE (B-SMOTE) (Han et al., 2005) and adaptive synthetic sampling approach (ADASYN) (He et al., 2008). The kernel-based SMOTE (KSMOTE) algorithm synthesizes minority data points directly in the feature space of SVM classifier and adds new data points by augmenting the original Gram matrix based on neighborhood information in the feature space (Mathew et al., 2015). Weighted kernel-based SMOTE (WK-SMOTE) overcomes the limitations of SMOTE for nonlinear problems by oversampling in the feature space and cost-sensitive support vector machine (Mathew et al., 2018).

Algorithm-Level

Traditional classification methods tend to favor majority class and ignore minority class samples when dealing with imbalanced data. To overcome the shortcomings of traditional classification, researchers have made improvements to the algorithms themselves. Typical improvements are cost-sensitive and ensemble learning methods. Fuzzy support vector machine (FSVM) (Lin and Wang, 2002) is a cost-sensitive algorithm. It introduces the fuzzy membership values (MVs) of each sample into the objective function of the support vector machine (SVM) to distinguish the importance of different samples. FSVM-CIL is an improved FSVMs for class imbalance learning that can be used to deal with class imbalances in the presence of outliers and noise, and its membership calculation is based on the distance in the original data space (Batuwita and Palade, 2010). Yu et al. (2019) proposed two relative density-based FSVM, namely, FSVM-WD based on within-class relative density and FSVM-BD based on between-class relative density, which use a similar strategy to calculate the relative density of each training sample based on K-nearest neighbor probability density estimation (KNN-PDE). ACFSVM is a FSVM method based on affinity and class probability, which calculates the affinity of majority class samples based on the support vector description domain (SVDD) model, and then identifies possible outliers and some border samples existing in the majority class (Tao et al., 2020). The basic idea of ensemble learning is to combine standard ensemble learning algorithms with existing imbalanced data classification methods, such as SMOTEBagging (Wang and Yao, 2009) and SMOTEBoost (Chawla et al., 2003). However, the training process of ensemble learning for base classifiers is more complicated and has limitations in handling high-dimensional data, and there are difficulties in choosing the type and number of base classifiers.

Proposed Method

The DFSVM method proposed in this paper uses a fuzzy support vector machine as the base classifier and uses data sampling method to obtain balanced data distribution. The new samples generated after oversampling still belong to the minority class, and the use of FSVM can further improve the model’s focus on the minority class. In addition, deep neural networks are used to obtain more discriminative feature information, which make subsequent classification convenient.

Feature Extraction With Deep Learning

With the significant increase in computer computing power and the explosive growth of data amount, deep learning has attracted a lot of attention in academia and industry in recent years. Deep neural networks (DNNs) have succeeded in significantly improving the best recognition rate of each previous problem by increasing the network depth or changing the structure of the model (Krizhevsky et al., 2012; He et al., 2016). Deep learning implementations rely on deep neural networks, which involves a cascade of multiple layers of nonlinear processing units for feature extraction and transformation. For example, convolutional neural networks (CNNs) are one such deep learning architecture that has achieved breakthrough performance gains in image classification (Kang et al., 2021). Feature representation is critical to the classification performance, so this paper applies the classification method to the embedding space after feature extraction.

In this paper, a deep neural network (DNN) is used as feature extractors because it can learn advanced feature representations from samples (Ng et al., 2016). Once training is complete, the hidden feature representations can be used as embedding features to reveal interesting structures in the data. To enhance the differentiation of features from different classes and reduce the differentiation of features from samples in the same class, a triplet loss (Schroff et al., 2015) is used to train the network model, bring samples in the same class closer, and further separate samples in different classes. Each sample can be converted into a differentiated feature space based on the trained model. The triplet loss uses anchor points, which allows the embedding space feature to be arbitrarily distorted. It is defined as:

\begin{matrix} L_{triplet} = {(D_{a, \min} - D_{a, maj} + m)}_{+} \end{matrix} (1)

where m is the margin and set to 0.2 in experiments, D is the distance function, a is the anchor point belonging to the minority class, min is the minority class samples, and maj is the majority class samples. ${(\cdot)}_{+}$ indicates that the value is taken as loss if it is greater than 0. If it is less than 0, the loss is 0. The smaller the margin, the easier the triplet loss converges to 0. Thus, the anchor point and minority class samples do not need to be pulled too close together, and the anchor point and majority class samples do not need to be pulled too far apart to make the loss converge quickly to 0. However, the smaller margin cannot distinguish the similar samples well. When the margin is too large, the distance between the anchor point and the minority class samples needs to be desperately close, and the distance between the anchor point and the majority class samples needs to be faraway. If the margin is set too large, it is likely that the final loss will remain at a large value, which is difficult to converge to 0, but can distinguish more similar samples with more certainty.

Figure 1 shows the results and geometric significance of optimization using triple loss. Triplet loss tries to learn an embedding space in which anchor is closer to the minority class samples, and the anchor is further away from the majority class samples. The deep neural network model with the triplet loss as the training criterion not only takes the simplicity of metric learning into account, but also has excellent nonlinear modeling capabilities of neural networks, which can greatly simplify and control the training process. When the two inputs are similar, the triplet loss can learn a better representation for the two input vectors with smaller differences, and thus perform well in the classification task.

FIGURE 1

FIGURE 1. Optimization result using triple loss function.

Gumbel distribution (Cooray, 2010) is used as the activation function in DNN. The Gumbel distribution, also known as Generalized Extreme Value (GEV) distribution type I, is widely used to design the distribution of extreme value samples of various distributions. The cumulative distribution function (CDF) is defined as:

\begin{matrix} σ (x) = e^{- e^{- x}} \end{matrix} (2)

When compared to the Gumbel distribution, the ReLU activation function shows some drawbacks for the class imbalance problem: it tends to underestimate the probability of minority nodes when dealing with the issue of class imbalance. Relative to the ReLU activation function, the Gumbel distribution function is not affected by the dying ReLU problem. Moreover, the Gumbel distribution is asymmetric, so that different penalties can be applied to the misclassification of both classes. In addition, the Gumbel distribution function is continuously differentiable, so it can be easily used as an activation function with optimization in a neural network. Finally, the whole DNN framework used for feature extraction is shown in Figure 2. The network used for feature extraction consists of three hidden layers, and we set the number of neurons in each layer to have the following relationship: the number of neurons in the next layer is half of the number of neurons in the previous layer. In the later experiments, we only set the number of neurons in the third layer, i.e., the dimension of the final embedding space, and the first two layers will make the corresponding changes according to the above rules. Figure 3 shows two t-SNE plots of the original Glass1 dataset and the dataset after the network training. It can be seen that after training, the different classes become easier to distinguish.

FIGURE 2

FIGURE 2. Deep neural network framework for feature extraction.

FIGURE 3

FIGURE 3. t-SNE plots of the original Glass1 dataset and after the network training.

Random Feature Oversampling Based on Center Distance

After obtaining the embedding space representation of samples, the data distribution is still imbalanced. The dataset in the embedding space is $X = {x_{1}, x_{2}, \dots, x_{n}}$ , $n$ is the total number of samples, $x_{i} = [f_{i}^{1}, f_{i}^{2}, \dots, f_{i}^{p}] \in ℝ^{p}$ , $i \in 1,2, \dots, n$ . $f_{i}^{j}$ is the value of the sample $x_{i}$ on the $j$ -th dimension feature, $j \in 1,2, \dots, p$ . For the minority class samples, the set of features in each dimension is denoted as $F = {F^{1}, F^{2}, \dots, F^{p}}$ , where $F^{j} = {f_{1}^{j}, f_{2}^{j}, \dots, f_{n_m i n}^{j}}$ , $j \in 1,2, \dots, p$ . $n_m i n$ is the number of the minority class samples. $F^{j}$ is the set of values of all minority class samples on the $j$ -th dimension feature. The feature of each dimension of the new synthetic sample is randomly selected from the corresponding feature set, $x_{s y n} = [f_{s y n}^{1} \in F^{1}, f_{s y n}^{2} \in F^{2}, \dots, f_{s y n}^{p} \in F^{p}]$ .

This method of randomly generated features can increase the diversity of the minority class samples and avoid overfitting. However, the method generates some outliers and noise, so a constraint based on class center distance is used to filter the synthetic samples. As shown in Figure 4, in the embedding space, the center of the majority class is $C_{m a j}$ , the center of the minority class is $C_{m i n}$ , and the center of the whole data is $C_{a l l}$ . By calculating the distance between each center and the synthetic sample to determine whether the following equation is satisfied:

\begin{matrix} d (x_{syn}, C_{maj}) > d (x_{syn}, C_{all}) > d (x_{syn}, C_{\min}) \end{matrix} (3)

where $d (\cdot)$ is the distance function. If the synthesized sample fits this condition, it will be kept, otherwise, it will be deleted. In this paper, the influence of irregular data distribution is avoided by calculating the class centers in the embedding space. The number of synthesized samples is set to achieve balanced data distribution.

FIGURE 4

FIGURE 4. Validation of the new synthetic feature vector.

Fuzzy Support Vector Machine

In many real-life applications, each sample has a different level of importance. For imbalanced data problems, the minority class samples are often more important than the majority class samples. In order to improve the classification performance, each sample needs to be assigned to a corresponding weight according to its importance. In this paper, a fuzzy support vector machine (FSVM) (Lin and Wang, 2002) is used as the classifier to achieve the assignment of different weights.

The data after sampling as $X = {x_{1}, x_{2}, \dots, x_{n}}$ , $n$ is the total number of samples including all synthetic samples, $x_{i} \in ℝ^{p}$ , $i \in 1,2, \dots, n$ . $p$ is the feature dimension. Assuming that the dataset is $D = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}$ . $y_{i} \in [1, - 1]$ is the label of the corresponding sample. FSVM adds an attribute to each sample to expand the original data set to $D = {(x_{1}, y_{1}, s_{1}), (x_{2}, y_{2}, s_{2}), \dots, (x_{n}, y_{n}, s_{n})}$ , $s_{i}$ represents the fuzzy membership value (MV) corresponding to different samples. The greater the value of $s$ , the greater the importance of the sample. In this way, the optimization function of FSVM can be written as:

\begin{matrix} \begin{matrix} \min : \frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{n} s_{i} ε_{i} \\ s . t . y_{i} (w * ϕ (x_{i}) + b) \geq 1 - ε_{i} \\ ε_{i} \geq 0 \end{matrix} \end{matrix} (4)

where $w^{2}$ represents the margin ratio of the generalization ability of the learning model. The slack variable $ε_{i}$ represents the acceptable training error degree of the corresponding instance $x_{i}$ . $C > 0$ is called the penalty parameter, it is a parameter that weighs the size of the separation interval and the number of misclassified points, as well as a trade-off between learning model accuracy and generalization ability. $ϕ (\cdot)$ is the mapping of high-dimensional feature space. The fuzzy membership value $s_{i}$ can adjust the punishment degree of the corresponding sample. In order to solve this optimization problem, firstly, Equation 4 is transformed into an unconstrained problem using the Lagrangian function:

\begin{matrix} L (w, b, α, β) = \frac{1}{2} w^{2} + C \sum_{i = 1}^{n} s_{i} ε_{i} - \sum_{i = 1}^{n} α_{i} (y_{i} (w * x_{i} + b) - 1 + ε_{i}) - \sum_{i = 1}^{n} β_{i} ε_{i} \end{matrix} (5)

The above formula satisfies the following conditions:

\begin{matrix} \begin{matrix} \frac{\partial L (w, b, α, β)}{\partial w} = w - \sum_{i = 1}^{n} α_{i} y_{i} x_{i} = 0 \\ \frac{\partial L (w, b, α, β)}{\partial b} = - \sum_{i = 1}^{n} α_{i} y_{i} = 0 \\ \frac{\partial L (w, b, α, β)}{\partial ε_{i}} = ε_{i} C - α_{i} - β_{i} = 0 \end{matrix} \end{matrix} (6)

Introduce Equation 6 into Equation 5. The optimization problem is transformed into the following formula:

\begin{matrix} \begin{matrix} \min : - \sum_{i = 1}^{n} α_{i} + \frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} y_{i} y_{j} α_{i} α_{j} ϕ (x_{i}) ϕ (x_{j}) \\ s . t . \sum_{i = 1}^{n} y_{i} α_{i} = 0, \forall i : 0 \leq α_{i} \leq s_{i} C \end{matrix} \end{matrix} (7)

where $α_{i}$ is the Lagrangian multiplier corresponding to $x_{i}$ , and it must also meet the KKT condition:

\begin{matrix} \begin{matrix} \forall i : α_{i} (y_{i} (w * ϕ (x_{i}) + b) - 1 + ε_{i}) = 0 \\ \forall i : (s_{i} C - α_{i}) ε_{i} = 0 \end{matrix} \end{matrix} (8)

In this way, the value of $α_{i}$ can be calculated, and $w$ can be calculated according to the following formula:

\begin{matrix} w = \sum_{j = 1}^{n} α_{i} β_{j} ϕ (x_{i}) \end{matrix} (9)

After that, the value of $b$ can be calculated by Equation 8. The sample of $α_{i} > 0$ is called a support vector. When $0 < α_{i} < s_{i} C$ , the support vector is located on the boundary of the interval. When $α_{i} = s_{i} C$ , the sample is located between the boundary of the interval and the separation hyperplane or on the side of the separation hyperplane that is misclassified. The biggest difference between traditional SVM and FSVM is that even though two samples have the same value of $α_{i}$ , the two samples belong to different types of support vectors due to their different fuzzy membership values $s_{i}$ . Under normal circumstances, a smaller $s_{i}$ is assigned to the majority class to make the decision boundary more reasonable. Finally, the decision function of the optimal separating hyperplane can be expressed as:

\begin{matrix} f (x) = s i g n (w * ϕ (x_{i}) + b) = s i g n (\sum_{j = 1}^{n} α_{i} y_{i} ϕ (x_{i}) ϕ (x) + b) \end{matrix} (10)

Experiments and Results

Evaluation Metrics and Datasets

In this paper, G-mean, F-measure and AUC values are used to comprehensively evaluate the classification quality of the model. In imbalanced classification, the overall accuracy is not effective in evaluating the classification results. To evaluate the imbalanced classification effect by accuracy may cause the model to be biased towards the majority class, because a high overall accuracy can be obtained by ensuring only the correct classification of the majority class. The overall accuracy ignores the important influence of the minority class.

F-measure is defined based on the metrics of Precision (Pre) and Sensitivity (Sen), which are defined as:

\begin{matrix} P r e = \frac{T P}{T P + F P} \end{matrix} (11)

\begin{matrix} S e n = \frac{T P}{T P + F N} \end{matrix} (12)

where TP (True Positives) denotes the number of positive observations (minority class) correctly classified as positive, FP (False Positives) denotes the number of negative observations (majority class) incorrectly classified as positive, FN (False Negatives) denotes the number of positive observations incorrectly classified as negative, and TN denotes the number of negative observations correctly classified as negative (Ye et al., 2020). The definition of F-measure is as

\begin{matrix} F - measure = 2 * S e n * P r e / (S e n + P r e) \end{matrix} (13)

G-mean is defined based on the metrics of Sensitivity (Sen) and Specificity (Spe), which are defined as:

\begin{matrix} S p e = \frac{T N}{T N + F P} \end{matrix} (14)

\begin{matrix} G - mean = \sqrt{S e n * S p e} \end{matrix} (15)

AUC (Area Under Curve) is defined as the area under the ROC curve and the coordinate axis. The value of this area will not be greater than 1. Among them, the ROC curve is called the receiver operating characteristic curve. It is based on a series of different binary classification methods (cutoff value or decision threshold), with the true positive rate (Sen) as the ordinate, and the false positive rate (1-Spe) is the curve drawn on the abscissa. The closer the AUC is to 1.0, the higher the authenticity of the detection method; when it is equal to 0.5, the authenticity is the lowest and it has no application value. The algorithm was tested on twelve binary classification datasets from the Keel database, as shown in Table 1.

TABLE 1

TABLE 1. Description of the datasets.

Experiment Settings

In data feature processing, a deep neural network with four fully connected layers is used. When using fuzzy support vector machine for classification operation, the kernel function is Gaussian kernel function. For FSVM classifier, penalty constant C and the width of Gaussian kernel σ are selected by gird search method from the set ${10^{- 3}, 10^{- 2}, 10^{- 1}, 1, 10^{1}, 10^{2}, 10^{3}, 10^{4}}$ and ${2^{- 5}, 2^{- 4}, 2^{- 3}, 2^{- 2}, 2^{- 1}, 1, 2^{1}, 2^{2}, 2^{3}, 2^{4}}$ . The fuzzy membership value of the minority samples is set to the imbalanced ratio (IR), which is the ratio of the number of samples of the majority class to the number of the minority class in the data.

\begin{matrix} I R = \frac{n u m_{m a j}}{n u m_{m i n}} \end{matrix} (16)

where $n u m_{m i n}$ is the number of the minority class samples, and $n u m_{m a j}$ is the number of data of the majority class samples. The fuzzy membership value of the majority class is set to 1. In order to eliminate the randomness, five cross validation is applied, and the algorithms are executed for 5 independent runs.

In order to compare the classification performance of the proposed model, nine methods are used. B-SMOTE (Han et al., 2005) uses SMOTE (Chawla et al., 2002) to synthesize new samples for the minority-class samples lying around the boundary line. SOMO (Douzas and Bacao, 2017) produces a two-dimensional representation. Then it generates within-cluster and between-cluster synthetic samples. KmeansSMOTE (Douzas et al., 2018) uses k-means clustering algorithm and SMOTE to balance datasets, and only oversampling in safe areas to avoid noise. FSVM-CEN and FSVM-HYP (Batuwita and Palade, 2010) use a linear decay function to calculate the MVs based on the distance from the own class center or from the actual hyperplane, and finally use FSVM for classification. FSVM-WD and FSVM-BD (Yu et al., 2019) adopt a k-nearest neighbors-based probability density estimation to design a membership function based on the within-class and between-class relative density, and then assign weights to different samples. ACFSVM (Tao et al., 2020) first gives the corresponding formulation of affinity to calculate different affinities for each sample in the majority class. Then the class probability of each majority-class sample is determined using the kernel KNN technique and combined with its corresponding affinity as MVs. In addition, note that the parameters that existed in each algorithm adopt the default ones in each corresponding reference. For DFSVM, the margin m and the number of neurons in the third hidden layer of the deep neural network are selected by gird search method from the set {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and {2,4,6,8,10,12}.

Comparison of Imbalanced Classification Performance

To verify the effectiveness of the model on imbalance classification tasks, DFSVM is compared with nine representative and state-of-the-art class imbalance algorithms. Table 2 summarizes the experimental results on part of the datasets in terms of G-mean, F-measure and AUC, and highlights the best performing models in boldface. It can be observed that DFSVM achieves better performance in most cases. For the Cleveland0vs4 dataset, DFSVM improves 0.123 relative to ACFSVM under the G-mean metric. Although DFSVM does not achieve the best results on the AUC metric on the Cargood dataset, it achieves the best classification results on the G-mean and F-measure, which indicates that DFSVM has good recognition for the minority class. On the Pageblocks0 dataset, the classification performance of DFSVM did not achieve the best results on all three different evaluation metrics. However, its classification performance does not differ much from the best result. For example, for G-mean, it differs from the best by only 0.2%, and F-measure differs by 1%, which indicates that DFSVM still has a better classification effect. Yeast6 dataset has the highest imbalance ratio with IR = 41.4 and Glass1 dataset has the lowest imbalance ratio with IR = 1.82. The proposed method in this paper achieves better classification results on both Yeast6 and Glass1datasets. The best results were obtained for all evaluation metrics on the Yeast6 dataset, and two evaluation metrics for Glass1 dataset. This shows that DFSVM has good classification results for datasets with different imbalance ratios and it is robust.

TABLE 2

TABLE 2. Results of different imbalanced classification methods on datasets.

In addition, statistical tests were performed in this paper to verify the validity and significance of the proposed method. Under the null hypothesis, all algorithms are equivalent (i.e., any difference between their mean ranks is only random). The Friedman statistic

\begin{matrix} χ_{F}^{2} = \frac{12 N}{k (k + 1)} [\sum_{j = 1}^{k} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4}] \end{matrix} (17)

is distributed according to the $χ_{F}^{2}$ distribution with $k - 1$ degrees of freedom when N and k are large enough. N is the number of datasets, k is the number of algorithms performed for comparison, and R is the average ranking of the algorithms under different datasets. Friedman’s $χ_{F}^{2}$ statistic is undesirably conservative, so we use a better statistic

\begin{matrix} F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (k - 1) - χ_{F}^{2}} \end{matrix} (18)

which is distributed according to the F-distribution with $k - 1$ and $(k - 1) (N - 1)$ degrees of freedom. We calculated the average ranking of different algorithms under the G-mean metric based on the experimental results in Table 2, and calculated the $F_{F} = 8.055$ . At significance level $α = 0.05$ , the critical value is 1.976. Since $F_{F}$ under G-mean is greater than 1.976, the null hypothesis is rejected, i.e., the algorithms compared are not equivalent at $α = 0.05$ . Since the null hypothesis has been rejected, the Nemenyi post-hoc test is utilized to complete the performance analysis, and DFSVM is regarded as the control method. When the difference between the mean rankings is greater than the critical difference ( $C D = q_{α} \sqrt{k (k + 1) / 6 N} = 3.91$ , where $q_{α} = 3.163$ in this paper), they are considered significantly different. The average ranking of the comparison method and DFSVM and the differences between them are given in Table 3. As can be seen from Table 3, DFSVM outperforms the other algorithms. Although it is not significantly different from KmeansSMOTE, FSVM-BD and ACFSVM, DFSVM is better than them in terms of classification effect on different data sets.

TABLE 3

TABLE 3. Average ranking of different methods on G-mean and its ranking difference with DFSVM.

Influence of Activation Function on the Performance of the Proposed Method

In this subsection, we compare the effect of the Gumbel activation function and the ReLU activation function on the experimental results. We selected five datasets, three of them are Ecoli datasets. They are different from each other, for example, in the Ecoli0147vs2356 dataset, positive samples belong to classes 0, 1, 4, 7 and negative samples belong to classes 2, 3, 5, 6; in the Ecoli01vs235 dataset, positive class samples belong to classes 0, 1 and negative samples belong to classes 2, 3, 5. In the experiments, the structure of the deep neural network is fixed, the number of neurons in the third hidden layer is set to 8, and the margin of triplet loss is set to 0.2. The experimental results are shown by Figure 5. It can be seen that the classification effect of Gumbel function is better than ReLU function. In the Ecoli0267vs35 dataset, the Gumbel function showed the most significant improvement. With the G-mean and F-measure metrics, the classification effect of the Gumbel function is 6.19 and 7.71% higher than the ReLU function. On the Glass1 dataset, although the classification quality of the Gumbel function is not as good as that of the ReLU function in the G-mean and F-measure metrics, the Gumbel function achieves better results in the AUC metric.

FIGURE 5

FIGURE 5. Classification effect of DFSVM under two different activation functions.

Influence of Classifier and Sampling Algorithms on Classification Performance

This subsection compares the classification quality of two different classifiers, SVM and FSVM. We also selected 5 datasets. In the experiments, the structure of the deep neural network is fixed, the number of neurons in the third hidden layer is also set to 8, and the margin of triplet loss is set to 0.2. The experimental results are shown in Table 4, and highlights the best performing models in boldface. It can be seen that the classification quality of FSVM is better than that of SVM. On the Ecoli01vs235 dataset, the AUC values of FSVM are slightly lower than those of SVM, but FSVM is 0.0524 and 0.2428 higher than SVM in G-mean and F-measure metrics. FSVM assigns a higher misclassification cost to the minority class in the objective function and therefore has a better imbalance classification effect.

TABLE 4

TABLE 4. Classification effect of DFSVM under two different base classifiers.

In addition, we compare the center distance based random feature oversampling method in DFSVM with different sampling methods, such as SMOTE, B-SMOTE, SOMO and KmeansSMOTE. In the experiments, we only replace the different sampling methods and keep the remaining settings the same. The experiments were conducted on four different datasets, and the results are shown in Table 5, which shows that the random feature oversampling method based on the center distance is better than the other sampling methods. The best results under each dataset are bolded. Although DFSVM does not have the best classification results on the G-mean and AUC for the Ecoli01vs235 dataset, it achieves the best classification results on the F-measure.

TABLE 5

TABLE 5. The experimental results of different sampling methods.

Influence of Network Structure and Margin on the Performance of the Proposed Method

This subsection conducts comparative experiments on different network structures of DFSVM. First, we selected three Ecoli datasets and one Yeast dataset for experiments on different margins. In the experiments, the margin of triplet loss is taken from {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the number of neurons of the third hidden layer of the deep neural network is set to 8. The experimental results on the evaluation metrics are shown in Figure 6. The x-axis represents the values taken for different margins, and the y-axis represents the experimental result values for different metrics, such as G-mean, F-measure and AUC. It can be seen that in the same dataset, different metrics have the same trend with margin. However, the tendency of variation is not the same among different datasets. On the Ecoli0267vs35 dataset, the best classification results are obtained when the margin is 0.3. However, on the Ecoli01vs235 dataset, the classification is worse when the margin is 0.3. For all datasets, a smaller margin gives a better imbalanced classification result. Smaller margins make it easier for the triplet loss to converge to zero, but too small margins can make it difficult for the model to distinguish similar features. We also conducted comparison experiments on DFSVM with different hidden layer sizes. The fixed margin size in the experiments is 0.2, and the number of neurons in the third hidden layer of the neural network is taken from {2,4,6,8,10,12}. The experimental results under different evaluation metrics are shown in Tables 6–8. We selected three Ecoli datasets and three Yeast datasets, and the best results are bolded. It can be found that for the Ecoli dataset, the better results are achieved when the number of nodes is 6 and 8. For the Yeast dataset, better classification quality is achieved at node numbers of 8 and 10. In addition, it can be seen from Table 2 that the features extracted by the deep neural network are more beneficial for classification in comparison with other FSVM-based methods.

FIGURE 6

FIGURE 6. Classification results under different margins.

TABLE 6

TABLE 6. Classification results of DFSVM with different number of neurons for G-mean.

TABLE 7

TABLE 7. Classification results of DFSVM with different number of neurons for F-measure.

TABLE 8

TABLE 8. Classification results of DFSVM with different number of neurons for AUC.

Conclusion

This paper proposes an imbalanced classification method combined with deep neural networks, DFSVM. In order to obtain features with intra-class similarity and inter-class discrimination, a deep neural network is trained using triplet loss function and Gumbel activation function to obtain the deep feature representation. The results of the experiments show that the proposed feature extraction method has good information acquisition ability and can effectively distinguish different classes. To balance the data distribution, a random feature sampling algorithm based on the center of class is used in the minority samples to maintain the diversity of the minority class samples. Compared with other sampling algorithms, it can effectively avoid overfitting and improve the generalization performance of the model. Fuzzy support vector machine has provided a higher misclassification loss for the minority class, and it enhanced the classification performance of the algorithm for the minority class. According to the experimental results, it can be found that the proposed DFSVM has good classification results on evaluation metrics: G-means, F-measure, and AUC. In future work, more efficient network structures and more robust feature extractors can be used to provide valid measures for imbalanced classification.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://sci2s.ugr.es/keel/imbalanced.php.

Author Contributions

K-FW carried out all the data collection and drafted the manuscript. JA designed the study and revised the manuscript. ZW provided statistical analysis ideas for this work, provided methods for over visualization of the results and revised the manuscript. CC and MC performed the data analysis. X-HM and H-QB critically reviewed the article. All authors approved the final manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (61703279, 51775385); in part by the Shanghai Industrial Collaborative Science and Technology Innovation Project (2021-cyxt2-kj10); in part by the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100) and the Fundamental Research Funds for the Central Universities; and in part by Innovation Program of Shanghai Municipal Education Commission (202101070007E00098). The authors are grateful for the efforts from their colleagues in Sino-German Center of Intelligent Systems, Tongji University.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, orclaim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Batuwita, R., and Palade, V. (2010). FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning. IEEE Trans. Fuzzy Syst. 18, 558–571. doi:10.1109/TFUZZ.2010.2042721

ORIGINAL RESEARCH article

Deep Learning-Based Imbalanced Classification With Fuzzy Support Vector Machine

Introduction

Related Work

Data-Level

Algorithm-Level

Proposed Method

Feature Extraction With Deep Learning

Random Feature Oversampling Based on Center Distance

Fuzzy Support Vector Machine

Experiments and Results

Evaluation Metrics and Datasets

Experiment Settings

Comparison of Imbalanced Classification Performance

Influence of Activation Function on the Performance of the Proposed Method

Influence of Classifier and Sampling Algorithms on Classification Performance

Influence of Network Structure and Margin on the Performance of the Proposed Method

Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Publisher’s Note

References

This article is part of the Research Topic

People also looked at