A Novel Oversampling Method Based on SMOTE and Local Sets for Imbalanced Classification

Learning a classifier from imbalanced data continues to be a challenging issue. The oversampling methods can improve the imbalanced classification from the perspective of data preprocessing. Different oversampling methods have been proposed. Nevertheless, most tend to generate unnecessary noise, create redundant synthetic samples in the class center and heavily rely on the parameter k. To solve the above issues, this work presents an oversampling method based on local sets and SMOTE (LS-SMOTE). First, the local sets are searched to describe the local characteristic of imbalanced data. Second, a local sets-based noise filter is designed to remove noise and smooth the class boundary. Finally, on each local set, the interpolation of SMOTE between a base sample and a selected sample closest to the majority class is employed to create the synthetic samples. Experimental results with 12 real data sets have proved that LS-SMOTE outperforms representative oversampling methods in training k nearest neighbor classifier.


Introduction
Supervised classification has been widely applied in data mining, artificial intelligence, machine learning, etc. The traditional classifier is trained on data sets with the class-balanced distribution. However, the data distribution is often skewed in many practical situations, such as image processing, gene recognition, natural language processing, etc. In skewed data, the samples' number of positive cases (minority classes) is much smaller than that of negative cases (majority classes). Hence, minority classes are difficult to be classified correctly but are often the most interesting from the perspective of the application. To improve the imbalanced classification, oversampling methods [1] have been developed and attracted the attention of scholars recently.
The oversampling methods improve imbalanced classification by creating synthetic samples to extend the minority class. Resampling [2] is the classical oversampling method. It amplifies the minority class by copying original samples until the class distribution is balanced. Nevertheless, resampling will cause over-fitting due to many duplicate samples. To improve generalization, the synthetic minority over-sampling technique (SMOTE) [3] creates synthetic examples by employing the interpolation among k nearest minority class neighbors.
So far, SMOTE has received extensive attention. Besides, it has many practical applications, such as recognition of oil and gas, risk evaluation of the pressure for oil and gas, geological hazard evaluation, etc. [4][5][6]. Because of the advanced nature, many oversampling methods are developed recently and are based on the idea of SMOTE. Examples are Borderline-SMOTE [7], ADASYN [8], Safe-Level-SMOTE [9], k-means SMOTE [10] RSMOTE [11], etc. Despite the effectiveness, most have been proved to have the following limitations: (a) Most oversampling methods tend to generate unnecessary noise. The interpolation between noise or (and) unsafe borderline samples is used to generate synthetic samples, which amplifies the effect of noise.
(b) Most oversampling methods may create redundant synthetic samples on the class center. Although samples of the class center can maintain the basic distribution of data, they have little contribution to classification performance.
(c) Almost all oversampling methods heavily rely on the neighbor's parameter k. This because they use the interpolation among k nearest neighbors to create synthetic samples. It is difficult to choose an appropriate k value in the different data distribution.
These above challenges have not been overcome simultaneously in oversampling methods so far, to the best of our knowledge.
In order to overcome the above issues, this work presents a method of oversampling based on local sets and SMOTE (LS-SMOTE). In LS-SMOTE, the local sets are searched to describe the local characteristic of imbalanced data. Second, a local sets-based noise filter is designed to remove noise and smooth the class boundary. Finally, on each local set, the interpolation of SMOTE between a base sample and a selected sample closest to the majority class is used to generate the synthetic minority class samples. In experiments, k nearest neighbors classifier (KNN) is employed to validate the LS-SMOTE by comparing 5 oversampling methods on 12 real data sets.

Related Work
SMOTE is an important oversampling method proposed by Chawla et al. [3]. SMOTE creates synthetic data by interpolation among k nearest neighbors. However, SMOTE is easy to generate noise due to the interpolation among noise or unsafe borderline samples. Hence, SMOTE-TL and SMOTE-ENN [12] proposed to use the Tomek Link (TL) and the Edited Nearest Neighbor Rule (ENN) to filter out noise in SMOTE. Additionally, the Iterative Partition Filter (IPF) [13] was also used to handle noise in SMOTE-IPF.
Borderline-SMOTE [7] and ADASYN [8] are two improvements of SMOTE. They use k nearest neighbors to creates synthetic samples on the class boundary. However, Borderline-SMOTE and ADASYN are susceptible to noise and heavily rely on the parameter k.
Safe-Level-SMOTE [9] and RSMOTE [11] create synthetic minority class samples in the class center, which strengthens the distributional characteristics of original data. Specifically, Safe-Level-SMOTE divides the minority into the noisy region, borderline region and safe region. Then, it just creates synthetic minority class samples in the safe region. RSMOTE introduces relative density and 2-means clustering algorithm to generate synthetic samples in the safe areas. However, Safe-Level-SMOTE and RSMOTE may create redundant synthetic samples in the class center. Synthetic samples in the class center have little contribution to classification performance. Besides, they are also susceptible to noise and heavily rely on the parameter k.
Georgios et al. [10] proposed k-means SMOTE. k-means SMOTE employs the k-means clustering algorithm to divide the entire data into k parts. Next, it calculates the sparsity of k parts and creates more synthetic samples in the sparser regions. However, k-means SMOTE relies on 5 parameters (including the parameter k) and can not detect noise in the original data.
While most existing oversampling methods manage to combat one or two weaknesses, none have been shown to overcome the generation of noise, avoid redundant synthetic samples in the class center and solve the choice of the parameter k. To fill this gap, LS-SMOTE is proposed.

Local Sets
The idea of local sets (LS) is based on the Nearest Enemy (NE). Specifically, the LS of sample xi is the set of samples whose distance to xi is smaller than the distance between xi and its nearest neighbor from a different class. The nearest enemy NE(xi) of xi is defined as follows: Definition 1. (Nearest Enemy): The Nearest Enemy (NE) of sample xi is its nearest neighbor from a different class. We denote the NE of xi as NE(xi).
Based on the definition of NE, the LS and LSC are defined as followed: The Local Set (LS) LS (xi) of xi is the set of samples whose distance to xi is smaller than the distance between xi and NE(xi).
In equation (2), |.| denotes the number. Figure 1 shows LS on toy data with two classes and different colors.  (c) If more LS contain samples xi, the sample xi will be safer. In figure 1

, noisy sample E is only contained in LS(E). Safer sample C is contained in LS(C) and LS(B). Safer sample D is contained in LS(A) and LS(B).
(d) If more samples regard sample xi as the NE, sample xi will be closer to other classes. In figure  1, the most remote minority class sample E has the most number of those samples whose NE is E.

Proposed Algorithm
The distance metric of this paper is the Euclidean distance. Let X={x1, x2, ..., xnmin, xnmin+1, ..., xn} be a training set. The number of samples in X is n. The number of attributes is d . Let Smin={x1, x2, ..., xnmin} be the set of minority class samples. The variable nmin represents the instance number of set Smin. Likewise, Let Smaj={xnmin+1, xnmin+2, ..., xn} be the set of majority class samples and nmaj represents the number of Smaj. The binary classification is our focus. The main ideas of LS-SMOTE contain three parts. First, the local sets are searched on imbalanced data. Second, a local sets-based noise filter is designed to remove noise and smooth the class boundary. Third, on each local set, the interpolation of SMOTE between a base sample and a selected sample closest to the majority class is used to generate the synthetic minority class samples. After that, imbalanced data can be improved by using synthetic minority class samples to extend it. Section 4.1 introduces the noise filter based on the LS. Next, Section 4.2 introduces the oversampling technique with local sets and SMOTE.

Noise Filter Based on Local Sets
As the analysis in Section 3 shows, if sample xi is contained in more LS, sample xi will be safer. Additionally, if sample xi is closer to other classes, more samples regard xi as the NE. Based on the above phenomenon, the usefulness of samples xi is defined as follows: Definition 4. (Usefulness of samples xi): the usefulness of sample xi is the number of LS which contains sample xi.

Definition 5. (Harmfulness of samples xi): the harmfulness of sample xi is the number of sample xj whose NE(xj) is xi.
According to equations (3)-(4), the usefulness indicates the security of sample xi, while the harmfulness indicates the abnormity of sample xi. Hence, the following formula is used to remove noise and smooth the class boundary.

Oversampling Technique with Local Sets and SMOTE
The pseudo-code for the LS-MOTE is described in Algorithm 1. LS-MOTE requires one parameter, i.e. the parameter N. N refers to the number of synthetic samples for each minority class sample. Finding the local set for each sample in X; by formulas (1)- (2); O(n 2 ) 3 Using formula (5) to remove noise and smooth the boundary; O(n 2 ) 4 Finding the local set for each sample in X by formulas (1) (5)) is used to remove noise and smooth the class boundary on Line 3. Next, Local sets are search again on Line 4 since noise can affect the local sets. Lines 5-18 are the process of interpolation. Concretely, a base sample (Base) is selected on Line 6. In the LS of the selected base sample, the farthest sample (SelectedSample) is selected on Line 9. SelectedSample has the smallest LSC. On Lines 10-13, the difference of each attribute between Base and SelectedSample is calculated. Next, the difference is scaled by the function rand(0, 1). The function rand(0, 1) refers to a random number in the range [0, 1]. On Line 12, the difference is employed to synthesize new samples. Finally, the set of synthetic samples (Synthetic) can be got on Line 18. As the analysis in Algorithm 1, the time complexity of LS-SMOTE is O(n 2 ). Figure 2 shows the results of applying LS-SMOTE on an artificial data set. Observing in figure 2, LS-SMOTE can remove noise in the original data and avoid noise generation. Besides, LS-SMOTE creates more synthetic borderline samples, which avoids the generation of redundant samples in the class center and is beneficial to classification performance.

Experiments
We select experimental data sets from UCI (University of California Irvine). These experimental data sets are real data sets from various fields. Table 1 introduces the adopted real data sets from 8 aspects, such as the number of samples, the number of attributes, the number of minority class samples, the number of majority class samples, etc.
Multiclass data sets were binarized using the one-versus-rest method. The smallest class is regarded as the minority class, and the remaining data is marked as the majority class. Additionally, we use the tenfold cross-validation to divide data sets into the training and test parts, respectively. Hence, experimental data is obtained by repeating the 10-fold cross-validation 10 times. The KNN classifier with k=3 is adopted to validate the comparison methods. F-measure with  =1 and G-mean are used as evaluation metrics.
In F-measure and G-mean, positive and/or negative cases are implied. In imbalanced classification, the positive case refers to the minority class, while the negative case refers to the majority class. If the value of G-mean is large, the classifier performs equally well on the samples of positive and negative cases. If the value of F-measure is larger, the accuracy in the positive case is higher.

Experiments on Real Data Sets
We validate LS-SMOTE by comparing it with 5 comparative methods, such as SMOTE, Safe-Level-SMOTE, Borderline-SMOTE, ADASYN and k-means SMOTE. Section 2 has described comparative methods in detail. The parameters of comparative methods are set as their standard versions. The parameter N is set to 2 in LS-SMOTE. Tables 2-3 shows the results in F-measure and G-mean.
In table 2, LS-SMOTE achieves the highest F-measure for 8 of 12 data sets. In table 3, LS-SMOTE achieves the highest G-mean for 10 of 12 data sets. Meanwhile, LS-SMOTE outperforms comparative methods in achieving the average F-measure and G-mean in the row labeled "Average". The results prove that LS-SMOTE outperforms comparison methods in improving the minority class. Also, LS- We also use the non-parametric two-sided Wilcoxon signed ranks test to assess the significant differences between the comparative method and LS-SMOTE. The significance level is 0.05 in the Wilcoxon test. The symbols +, -and ~, in the row labeled Wilcoxon, indicate that LS-SMOTE is significantly better, worse or equivalent compared with others. Observing the row labeled "Wilcoxon", LS-SMOTE is significantly better than comparison methods in tables 2-3.
Generally, experiments prove that LS-SMOTE outperforms comparative methods in improving KNN.

Conclusions
We aim to solve the difficult issues, namely, the generation of unnecessary noise and redundant synthetic samples in the class center, and the choice of parameter k. Hence, an oversampling method based on local sets and SMOTE (LS-SMOTE) is proposed. First, the local sets are searched to describe the local characteristic of imbalanced data. Second, a local setsbased noise filter is designed to remove noise and smooth the class boundary. Finally, on each local set, the interpolation of SMOTE between a base sample and a sample closest to the majority class is used to generate the synthetic minority class samples.
In experiments, the performance of LS-SMOTE is validated. In detail, we compare LS-SMOTE with 5 popular oversampling methods on 12 real data sets. Experiments have proved that LS-SMOTE outperforms comparative methods in improving the positive case (i.e., minority class) and achieving 2 metrics (i.e., F-measure and G-mean) of KNN classifier.