Increasing Minority Recall Support Vector Machine Model for Imbalanced Data Classification

Imbalanced data classification is gaining importance in data mining and machine learning. )e minority class recall rate requires special treatment in fields such as medical diagnosis, information security, industry, and computer vision. )is paper proposes a new strategy and algorithm based on a cost-sensitive support vector machine to improve the minority class recall rate to 1 because the misclassification of even a few samples can cause serious losses in some physical problems. In the proposed method, the modification employs a margin compensation to make the margin lopsided, enabling decision boundary drift. When the boundary reaches a certain position, the minority class samples will be more generalized to achieve the requirement of a recall rate of 1. In the experiments, the effects of different parameters on the performance of the algorithm were analyzed, and the optimal parameters for a recall rate of 1 were determined. )e experimental results reveal that, for the imbalanced data classification problem, the traditional definite cost classification scheme and the models classified using the area under the receiver operating characteristic curve criterion rarely produce results such as a recall rate of 1. )e new strategy can yield a minority recall of 1 for imbalanced data as the loss of the majority class is acceptable; moreover, it improves the g-means index. )e proposed algorithm provides superior performance in minority recall compared to the conventional methods. )e proposed method has important practical significance in credit card fraud, medical diagnosis, and other areas.


Introduction
Imbalanced data classification is gaining importance in data mining and machine learning [1][2][3]. An imbalance in a particular dataset occurs when the number of instances in one class (the minority class) is significantly smaller than that in the other class (the majority class). e minority class is generally of interest in data classification; hence, it is considered a positive (+) class, whereas the majority class is considered a negative (− ) class [4,5]. For instance, in the medical field, as there is more interest in the disease samples than in the health samples, the disease samples comprise the minority class. Traditional machine learning models mainly consider the accuracy of the two categories as equally important, i.e., they do not consider the accuracy of some categories to be more important than that of other categories, particularly when the sample size of the minority category is small [6,7]. Although these models have been successfully applied to various balanced data classification problems, their performance decreases substantially when applied to imbalanced datasets [8][9][10] owing to the lack of sufficient training data in the positive class, which are required to perform an accurate classification of its instances. To address this issue, the positive class of a dataset has attracted increased attention [11], and traditional classifiers have undergone numerous improvements in order to be able to handle imbalanced data applications.
In certain applications, the tolerance to errors in the positive class is extremely low. For instance, in the case of industrial system fault diagnosis [12], the classifier must deal with an imbalanced dataset, i.e., the number of available healthy class instances outnumbers the faulty class ones. Accordingly, it is necessary to develop a classifier that accounts for this type of imbalanced data distribution and warns of all possible failures, even if there may be many false-warning occurrences. Credit card fraud detection is a well-known classification problem [13]. In order to target a specific customer segment, banks use data mining algorithms to classify customers as buyers and nonbuyers. In this context, if a model correctly detects a potential customer for a campaign, there will be a particular profit related to gaining that customer; if a potential buyer is not identified, the profits that would be gained from him/her might be lost. However, if a potential nonbuyer is identified as a buyer, credit card fraud would occur, causing the bank potential massive losses. Similarly, a 100% recall model reduces some of the profits but does not risk credit card fraud occurrences. Furthermore, failing to diagnose a cancerous lesion is unacceptable and can have devastating effects on a patient, although this situation rarely occurs [14]. In general, the classifier is only used as an aid to manual diagnosis, which means that the classifier can only diagnose patients at risk of cancer and provide an artificial judgment.
ere are disastrous consequences if the classifier misses any patient who may have cancer. In this study, we developed a costsensitive support vector machine (SVM) model to increase the recall of a positive sample from an actual background to 100%. Based on this model, we proposed a medium algorithm strategy to increase the positive-class recall rate to 1. We introduced different penalty factors, namely, C + and C − , for each of the positive and negative SVM slack variables during the training process and adjusted the classification boundary by altering the positive-class margin.
is approach ensured that the positive-class samples would be more generalized to achieve the target recall rate of 1 when the boundary reached a certain position. For parameter selection, we employed the grid search approach and selected a model with a recall of 100% and a higher specificity for the negative class. e SVM model was adopted to address the data imbalance and exhibited an acceptable performance [15][16][17][18][19][20][21].
Our study makes a significant contribution to the field because we were able to confirm that (1) increasing the recall rate to 100% is a feasible classification indicator; (2) the decision boundary could be altered successfully by correcting the positive margin; and (3) the g-means increases when the recall rate is increased to 1 in some datasets. e experimental results demonstrated that these advantages improve the positive-class classification performance to a greater extent than those achieved in previous studies. e remainder of this paper is organized as follows. In Section 2, the cost-sensitive SVM model is briefly reviewed. In Section 3, the proposed cost-sensitive model is introduced to improve the recall rate of the positive classes. Section 4 describes the tests performed using the new method on actual data. Finally, Section 5 discusses and summarizes the advantages and disadvantages of the proposed method and suggests future research directions.

Related Work
In this section, we will briefly discuss different imbalanced dataset classification problems. e existing classification methods for imbalanced data can be roughly divided into two categories [22,23]: data-level and algorithm-level methods [24][25][26][27]. We will first discuss the most effective approaches and then discuss the advantages and limitation of these proposed approaches.
Data-level approaches, which are also known as sampling methods [28], typically involve data preprocessing. ese approaches rebalance highly skewed class distributions using various resampling methods, such as oversampling of the positive instances and undersampling of the negative instances, and at times, both methods are combined as well [29]. e simplest way to balance a dataset is by undersampling (randomly or selectively) the majority class, while keeping the original data of the minority class. However, this method results in loss of information of the majority class [30]. Another approach that can be used is oversampling in which the minority class instances are randomly duplicated to rebalance class distribution. Although oversampling does not result in loss of information of the majority class, it can cause overfitting. To solve this issue, Chawla et al. [31] proposed a method called Synthetic Minority Oversampling Technique (SMOTE) to generate new instances by linear interpolation between closely lying minority class samples. SMOTE generates new minority samples by interpolating between k-nearest minority class neighbors and has a better classification effect than random oversampling. However, the samples generated through this method may cause an overlap between the two categories.
In contrast, using algorithm-level approaches [32,33], researchers have been able to introduce cost-sensitive learning to reduce the degree of imbalance by assigning a higher learning cost to positive-class samples [34][35][36]. Algorithm-level approaches directly modify the learning procedure to improve the sensitivity of the classifier toward minority classes. One such crucial approach to class-imbalanced learning was proposed by Veropoulos et al. [37], who used different penalty constants for different classes to assign higher costs to errors in classifying positive-class instances than those in classifying negative-class instances. However, this method does not consider the distance between the two types of samples and the classification hyperplane.
Some studies successfully applied the aforementioned methods in several different fields such as cancer diagnosis [38], sentiment analysis, and text classification [39]. In this study, we improved Veropoulos's method, modified the distance between the positive samples and the classification hyperplane, and developed an SVM classification algorithm with special classification purposes.

Cost-Sensitive Support Vector Machine
e goal of classification is to map feature vectors x ∈ X to class labels Y � − 1, 1 { } [40]. As in previous studies [41,42], a training set (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x N , y N ) is given, where x i is an instance with an n-tuple of attribute values that belong to a certain instance space X and y i ∈ Y � − 1, 1 { } is a label.
A standard classification problem of a linear SVM can be expressed as 2 Discrete Dynamics in Nature and Society arg min subject to where C > 0 is the penalty parameter. e standard SVM model for the design of classification algorithms minimizes the probability of error, assuming that all misclassifications have the same cost. To control the classification recall, the penalty-regularized model proposed by Veropoulos et al. [37] was closely inspected. e key idea of this model is to introduce uneven loss functions to reweight the penalties of the samples in the imbalanced classes [7] and reduce the bias of the classification boundary toward the negative class. By predetermining the class labels, where I + and I − denote the index set for the positive and negative classes, respectively. When I + and I − are categorized, different costs are assigned to the two classes. e standard SVM can be expanded to arg min subject to where C + is the cost of a false negative and C − is the cost of a false positive. ξ i are slack variables. In the regularization model proposed by Veropoulos et al. [37], the weight vector, w, is a d-dimensional transposed vector normal to the decision boundary; the bias, b, is a scalar for offsetting the decision boundary; and the slack variables, ξ i , measuring the losses are used to urge samples to satisfy the boundary constraints in the optimization. us, the cost value for the positive class is typically higher than that for the negative class. e dual problem of the primal model is written using Lagrange multipliers as follows: subject to

Increasing Minority Recall Model
e key objective of this study was to identify a misclassification cost value with a special purpose using a specific method, assuming that the costs of all types of misclassifications are not equal and that the true costs of misclassification cannot be determined. e goals are to increase the recall rate of the positive class of all datasets to 100% for physical problems and to improve the accuracy of the negative class as much as possible. When positive recall is increased to 1, the accuracy of the negative class may be affected but does not decrease significantly and is within an acceptable range.

Strategies to Improve Recall Rates for Minority Samples.
Presently, in cost-sensitive learning, the cost-sensitive factor is often determined by a random interval or by using the sample number ratio between categories as the misclassification cost [43]. However, we developed a class of imbalanced datasets whose data structure enables us to search for misclassification costs with a "special purpose," which is increasing the positive recall rate to 1 because misclassification of positive samples can cause massive losses in physical problems. Modifying the loss function forces the classification algorithms to be biased toward the positive classes, and the classification boundary leans toward the negative class.
e key idea is to adjust the margin of the positive class to cause the classification boundary to shift. Because the theoretical threshold of 0 is used as the judgment threshold of the sign function, as long as the 0 point is on the left side of the classification boundary, as many positive classes as possible will be included. Using the grid search method, the theoretical threshold and classification boundary are adjusted, resulting in a recall rate of 1.
When the data distribution is as shown in Figure 1, the use of this method will be limited because there is too much overlap between classes. e interface adjusted by us can classify positive-class samples in the imbalanced dataset not only correctly but also simultaneously. In addition, it can divide the samples of the negative classes in the overlapping area into samples of positive classes.

Proposed Support Vector Machine Model.
e SVM uses minimization hinge loss function L(yf) � ⌊1 − yf⌋ + , where Veropoulos et al. [37] extended the loss function in the biased support vector machine (B-SVM) classification as follows: Equations 8 and (9) assign different cost values to instances in the positive and negative classes, respectively. e misclassification costs from samples in the negative class are generally exploited to outweigh those in the positive class. As our objective was to increase the recall rate of the positive-class samples to 1, we put a constraint on the positive-class margin and extended the loss function as follows: Discrete Dynamics in Nature and Society In Figure 2, C + ·controls the slope of the positive class, k controls the intersection of the abscissa axis, and the intersection is 1/k. C − controls the negative slope, and the intersection is 1. If the loss is 0, the classification confidence of the loss function must be sufficiently high. For the positive class, the loss is 0 when the degree of confidence is greater than 1/k.
By replacing the original loss with the loss functions shown in (10) and (11), the original SVM can be extended to arg min subject to Here, C + and C − are two types of costs, and the positive margin can be changed by adjusting the value of k. In addition to the adjustable penalty, the motivation of this study is to provide the loss function of the imbalanced classes a different hinge point. A biased decision boundary caused by imbalanced classes can be recovered with the help of a scalable margin. Herein, we develop the cost-sensitive model for solving the SVM class-imbalanced problem that has both an adjustable penalty and a scalable margin.
To solve the newly created problem, the Lagrangian function is introduced. e dual problem of the primal model can be written using Lagrange multipliers as follows: where α i ≥ 0 and μ i ≥ 0.
erefore, the dual optimization model of (12) is defined as subject to Our objective is to solve the dual problem (Algorithm 1).

Performance Evaluation.
In a classification problem, evaluation measures play a key role in assessing the performance of the classification model. e overall prediction accuracy is used to evaluate the classification of a balanced dataset; however, it is not an effective metric for an imbalanced dataset because it does not consider the prediction accuracy of either class. is lack of consideration is mainly due to the fact that the negative sample size is sometimes much larger than the positive sample size. In such cases, with an imbalance of 99 to 1, a classifier that classifies everything as negative will be 99% accurate, but it will be completely useless as a classifier. erefore, more attention should be paid to the positive class. e current classification indicators are based on the confusion matrix presented in Table 1.
In the confusion matrix, true positive (TP) is the number of positive-class instances that have been correctly classified, true negative (TN) is the number of negative-class instances that have been correctly classified, false positive (FP) is the number of negative instances that have been incorrectly classified as positive, and false negative (FN) is the number of positive instances that have been incorrectly classified as negative.
To select the classification index of the positive class, we directly selected the recall (recall � TP/(TP + FN)) and specificity (specificity � TN/(TN + FP)), ensuring both a recall rate of 1 and high specificity.
To balance the effects of recall and specificity on the classification results, an evaluation index, g-means, can be constructed using the geometric mean of equilibrium recall and specificity: Although the number of positive samples may be very few, the omission and misjudgment of positive samples will be considered sufficient. Even if the classification accuracy rates for all the samples are excellent, the g-means value may be very low.
e g-means indicator is effective for the classification evaluation of imbalanced data; however, it attaches equal importance to recall and specificity. As more attention was paid to recall in the present study, g-means was extended to recall * g-means, which increases the emphasis on recall. Ultimately, the proposed method performed the best with more emphasis on recall.

Experiments.
We used ten class-imbalanced datasets with various positive ratios and compared our algorithm with the B-SVM [37], cost-sensitive support vector machine (CS-SVM) [44], and BP neural network algorithms [45], as well as two special classification costs. ese experiments proved that it is feasible to adjust the positive-class margin to achieve a positive recall rate of 1. Table 2, we compared the classification differences between the proposed method and other methods under different classification standards. We selected ten real-world imbalanced datasets from the UCI machine learning data repository [46]. Among them, ecoli1, ecoli2, glass6, car1v3, car1v4, glass5, segment1, and glass6 were constructed using a multiclassification problem.

Experimental Design.
We compared the proposed method with several SVM-based methods and neural network, including the B-SVM, CS-SVM, and BP approaches, as well as two special classification costs (no cost and prorated cost). For the prorated cost, we set the misclassification cost based on the following equation: misclassification cost(positive class) misclassification cost(negative class) � total number of negative instances total number of positive instances . Discrete Dynamics in Nature and Society e RBF Gaussian kernel K(x, z) � exp(− ‖x − z 2 ‖/2σ 2 ) is used for all SVM algorithms, where σ is the bandwidth parameter that must be predetermined. We attempted to select the best σ value through a grid search of the σ range between 0.01 and 20. For different datasets, while searching for parameters such as C and k, different grid widths were selected to accelerate parameter selection. For both positive and negative class costs, the range of [0.01, 100] was selected, and the search range of K was fixed at [0.01, 10]. Logistic was selected as the activation function. e learning rate was selected within the range of [0.000001, 1], the number of hidden layers was selected from [1][2][3], and the search range of the number of neurons in each hidden layer was [2,1000]. Five-fold cross-validation was used in each set of parameter experiments to determine the set of parameters with the best generalization performance. We chose the recall, specificity, g-means, and area under the receiver operating characteristic curve (AuROC) instead of global evaluation indicators to evaluate the performance of the classification method for imbalanced datasets. No cost and prorated cost were classified using a judgment threshold of 0, whereas the B-SVM, CS-SVM, and BP Set the 3D grid search range of C p , C n , k: (a) Select optimization variables α (n) 1 and α (n) 2 and solve the optimization problem using the sequential minimal optimization (SMO) algorithm to obtain α (n+1) 1 and α (n+1) 2 , and update α to α (n+1) (16)- (19) are satisfied within the allowable range of precision, ε, the KKT condition can be used for the next step; otherwise, continue with process (b).

. (b) If the KKT conditions in
T is obtained, and w * and b * are calculated as follows: Return G t End ALGORITHM 1: Algorithm to increase the recall rate to 1. Discrete Dynamics in Nature and Society approaches used AuROC as the classification standard for the recall, specificity, and g-means.

Results and Discussion
To demonstrate the ability of the proposed method to adjust the classification boundary visually, we select two sets of data (yeast5 and breast cancer d) and projected them into a twodimensional space for visualization. Firstly, to demonstrate the influence of the value of k on the change in the decision boundary under the two sets of data, we used different k values, as shown in Figure 3. In Figure 3, from left to right, the values of k are as follows: k � 1, k � 3, k � 2.416, k � 1, k � 3, and k � 2. e values were selected after a broad survey, and the recall of the adjusted k value of both datasets is 1. For the breast cancer d dataset, it is obvious that the classification effect of the adjusted k value is better than the k value that is randomly selected by the other two groups; hence, we will not explain this dataset in detail. However, for the yeast5 dataset, the other two kinds of k will also reach a recall of 1. e preliminary observation is that the classification effect when k � 1 is similar to that when k � 2; therefore, we compared their respective g-means. When k � 1, g-means � 0.936; when k � 3, g-means � 0.900; and when k � 2, g-means � 0.964. e comparison revealed that the adjusted k value not only increased the recall to 1 but also did not reduce the g-means value. erefore, changing the margin of the positive class can adjust the position of the classification boundary so that the positive classes can be properly classified by the boundary.
In addition, the receiver operating characteristic (ROC), through which the threshold is determined, is a widely used evaluation index for classification problems. For the classification problem, we obtained a set of predicted values and classified the data by traversing the predicted values and used the predicted values as thresholds. Predicted values less than the threshold are classified as negative, and predicted values greater than the threshold are classified as positive.
erefore, for each set of predicted values, we can determine a unique set of TPR and FPR. e threshold determination criterion of the ROC curve is maximum (true positive rate (TPR)-false positive rate (FPR)) when the TPR-FPR is the largest, and the corresponding threshold is the threshold of the ROC curve.
Frequently, the classification threshold under the ROC criterion cannot increase the recall to 1, such as in the experiments on four datasets whose results are presented in Figure 4. e experiments prove that the threshold converges to a certain value as k increases. As shown in Figure 4, the classification ROC threshold of the four datasets decreases gradually with increasing k and exhibits an obvious trend of asymptotic stability. Based on the experimental results, an appropriate threshold can be chosen to ensure that the requirement of 1 can be achieved again by adjusting k. We completed five cross-validation experiments to reduce randomness, and the results indicate that our scheme is universal and widely applicable. Table 3 summarizes the recall and specificity obtained with different methods. In Tables 4-6 Table 4. In the datasets shown, the average recall rate of the proposed method is the largest, effectively increasing the recall rate to 1 in each case. e g-means index is shown in Table 5. It can be observed from Table 5 that although the average g-means value of our method is not the highest, this value is not reduced much compared to that in the other three ideal classification cases. Moreover, the g-means value of our proposed method is higher than that of the other five cases for some datasets. It can be observed that when the recall rate of the positive class reaches 1, the accuracy of the negative class does not decrease significantly and is within the acceptable range. ese results demonstrate that the proposed method outperforms the other methods for the positive class of imbalanced datasets and that there may be a certain degree of improvement in the g-means index. Figure 5 shows the combined g-means indicators obtained using several methods. Because we consider the recall rate to be extremely important, g-means multiplied by recall is n, used as the new indicator, which improves the weight of recall in g-means, so that the new indicator weighted emphasis on the recall. e g-means indices that attach more importance to recall are presented in Table 6. When the new indicator is adopted, our proposed method has the highest average value. Figure 6 clearly depicts the recall * g-means evaluation indicator under the comprehensive influence of the classification effects of the positive and negative classes. ese results verify that the proposed method performs well on all the datasets.
For the statistical analysis, we implement Studentʼs t-test to verify whether significant differences exist between the proposed method and other methods in the experiment. e t-value in Studentʼs t-test is calculated as follows: where xrepresents the example mean of the data; σ is the standard variance of the data; and n is the sample size. In this case, the sample size is set to 10. As a case study, we compare the proposed method with other methods. We calculate the t-value using recall and g-means data listed in Tables 4 and 5. e null hypothesis should be H 0 : x 1 � x 2 , and the alternative hypothesis should be H 1 : x 1 ≠ x 2 . Let x 1 be the sample mean obtained by the proposed method and x 2 be the sample mean of the other three methods considered for the comparison. e same is true for the g-means test. ree t-tests were conducted for the three models listed in Table 4, and the results were compared. For the recall of BP, CS-SVM, and B-SVM models, the t-value obtained is 2.258, 2.631, and 2.671, respectively. We found that the t-value is 1.813 with the probability threshold of 0.05 using Student's t-distri-   Discrete Dynamics in Nature and Society    models, the t-value obtained is 1.320, − 0.689, and − 0.483, respectively. e calculated t-values 1.320, − 0.689, and − 0.483 are all lower than the t-value 1.813; thus, the null hypothesis cannot be rejected. erefore, at the 0.05 level of significance, we believe that no significant difference exists in the g-means indicators obtained by several models.
Based on the above conclusions, at the 0.05 level of significance, the recall of the proposed method is significantly greater than that of other methods, and no significant difference exists in the g-means.

Conclusions
In this paper, a cost-sensitive SVM algorithm based on an imbalanced margin was proposed for the classification of imbalanced data.
is method was based on the theory proposed by Veropoulos et al. [37], and its feasibility was verified using both theoretical and experimental results. e recall rate of small classes was improved by adjusting the SVM positive classification margin. e proposed method was also compared with other traditional methods. e experimental results demonstrate that a small-class recall rate of 1 can be achieved using the proposed method. However, the proposed approach still has some disadvantages. Specifically, the accuracy of positive classes is lost in some datasets, but in many cases, the performance improves compared with that of the traditional methods. When the classification evaluation criteria are changed (i.e., more emphasis is placed on the positive classes), the average evaluation index of the proposed method is the highest. Such classification results are of great significance in the fields of finance, medicine, engineering, and astronomy, to name some. In future work, we will test the experimental setup employed in this study using different machine learning models and attempt to apply this method to practical problems. Additionally, we will extend the proposed method to multiclass classification problems by adopting a oneversus-all approach.
Data Availability e datasets used to support the findings of this study have been deposited in the UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets).

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.