The Effect of Resampling on Classifier Performance: an Empirical Study

ABSTRACT

Several examples of methods are included in the resampling technique, such as random undersampling, random oversampling, and SMOTE [4]. Undersampling is a resampling by reducing the data in the majority class. This technique is effective in overcoming class imbalance because a lot of the majority class data is ignored so that the dataset becomes more balanced, and the data training process becomes faster [5]. Oversampling is a resampling technique by adding data to the minority class. Oversampling can add necessary information to minority classes and prevent misclassification [6]. This study will use random undersampling, random oversampling, and SMOTE as resampling techniques to overcome class imbalances in the used dataset.
Related research on resampling showed that random undersampling significantly improves the classification performance of imbalanced classes in Medicare Big Data [7]. The results obtained from the research were random undersampling, which got an AUC score of 97%. Related research on using the random oversampling and SMOOTE technique outperformed the other resampling [8]. Majorityto-Minority Resampling (MMR), a hybrid approach to pick switched instances, adaptively selects potential instances from the majority class to enhance the minority class, showing that the result of the proposed approach outperforms several strong baselines across standard metrics for imbalanced data [9]. Similarity Oversampling and Undersampling Preprocessing (SOUP), which resamples tough cases, outperforms specialist preprocessing methods for multi-imbalanced issues and competes with the most famous decomposition ensembles on natural and artificial datasets [10]. Borderline, Random Over Sampler, SMOTE, SMOTE-ENN, SVM-SMOTE, and SMOTE-Tomek handle imbalanced data and predict pupil success using two datasets using machine learning models like Random Forest, K-Nearest-Neighbor, Artificial Neural Network, XG-boost, Support Vector Machine (Radial Basis Function), Decision Tree, Logistic Regression, and Naïve Bayes [11]. SVM-SMOTE outperforms other resampling methods in the Friedman statistical relevance test. Random Forest was best after SVM-SMOTE resampling.
This study's motivation and new contribution lie in evaluating the resampling algorithm with three different classifiers: Naïve Bayes Classifier, Decision Tree, and Backpropagation Neural Network (BPNN) on 100 public datasets. While the resampling algorithm has been widely used in the literature, its effectiveness in improving classification performance on imbalanced datasets is still an open question, particularly when combined with different classifiers. Furthermore, this study goes beyond simply applying the resampling algorithm and evaluates its impact on classification metrics, including accuracy, precision, recall, and f-measure, which are particularly relevant in resampling applications on classifiers where false negatives and false positives can have significant consequences.
Overall, this study aims to find the significance of resampling on the classifier's performance between the resampled dataset and the classifier performance of the dataset without resampling. To achieve that goal, an empirical study was carried out on 100 datasets by comparing the results of the classifier performance from the dataset without resampling. Each dataset applied resampling techniques: random undersampling, oversampling, and SMOTE. Then the dataset without resampling and the dataset that has been resampled are classified using three different classifiers. The classification metrics values were tested using paired t-tests to find the significance between resampling and combining the classifier with resampling that can provide the most significant results. These findings can inform the development of more effective and reliable machine-learning models for resampling applications on classifier performance. In Section II of this article, the methodologies that were utilized in this research are explained. The findings are presented in Section III, with a discussion of those results and a comparison of relevant models. The last part, the conclusion, can be found in Section IV.

A. Data Collection
This study used 100 datasets obtained online through 3 websites: UCI Machine Learning, Kaggle, and OpenML. Each dataset has a numeric input and a binary class. Most datasets have more than ten attributes and have fewer than 1000 instances. The fields used in the Dataset can be seen in Figure 1.

B. Data Preprocessing
Two preprocessing steps are carried out in this phase, imputing missing values and resampling. Impute missing values aims to replace the missing values in the dataset with new sample values. In this study, the method used to create new samples is the K-Nearest Neighbor (K-NN) with a kneighbor of 10 and a Manhattan distance. This method will look for cases that are similar to issues where there are missing values. The two cases are identical if each attribute of the two cases is close together. If a similar case has been found, the attribute with missing values will be filled with the value from the attribute value with a similar case. The K-NN algorithm can provide more robust and more sensitive predictions of missing values [12]. A k-neighbor of 10 to overcome missing values can minimize the error rate when doing classification [13]. Manhattan distance can give ba better results than the other kind of distance (Euclidean Distance, Correlation Distance, and Cosine Distance) [14].
Resampling aims to resample each dataset using random undersampling, random oversampling, and SMOTE with a ratio of 100%. Random undersampling is a technique that removes some randomly selected data to decrease the majority [15]. Random oversampling is an oversampling technique by duplicating randomly selected data to increase the minority [16]. SMOTE is an oversampling technique that creates new synthetic data from some of the closest selected data using the K-NN [17]. SMOTE, with a ratio of 100%, is of a kind feel the payment ratio in the smote where the process of making new samples is carried out until the minority has the same number as the majority.

C. Data Classification
Three classifiers used to classify this study are Gaussian Naïve Bayes, Decision Tree with C4.5 Algorithm, and BPNN. Gaussian Naïve Bayes is a kind of Naïve Bayes Classifier that uses the gaussian normal distribution to calculate the probability of each attribute. Naïve Bayes Classifier is a classifier based on Bayes' Theorem with an assumption of independence among features [18]. The normal distribution formula can be shown in (1).
Where ( ) is the normal distribution of each attribute in each class, is the standard deviation for each attribute in each class, is the mean value of each attribute in each class, and is the sample value of each attribute. Here is the pseudocode of Gaussian Naïve Bayes.  a. For each class j: i. Calculate the likelihood P(x|y=j) using a Gaussian probability density function with mean μij and standard deviation σij for each feature i. ii. Calculate the posterior probability P(y=j|x) using Bayes' theorem: P(y=j|x) = P(x|y=j) * P(y=j) / P(x). b. Choose the class with the highest posterior probability as the predicted class for x.

} End
The Decision Tree Classifier is one classifier that makes branching conditions based on specific attribute values that are done until the branching process cannot be done. There are three parts to the Decision Tree, they are the root node is the main attribute that influences determining class [19], the branch node is the attribute that is selected next after the root node [20], and the leaf node is the class label of each branch that is passed [21]. So that the decision tree structure is similar to a tree structure. This study used the C4.5 Algorithm to determine the branch. The C4.5 Algorithm is a development of the ID3 method, which can provide better accuracy than the ID3 method [22]. The formula to calculate the Gain Ratio can be shown as in (2) to (6).
Where ( ) is the gain ratio value for each split point, ( ) is the information gain value for each split point, ( ) is the split info value for each split point, ( ) is the overall Entropy value in the dataset, ( ) is the entropy value for each split point, is the number of events at each split point, is the total number of events at each split point, is the number of times class label type, is the split point value and is the probability value for each class. This is the pseudocode for Decision Tree with C4.5 Algorithm. • Calculate the information gain IGi = H(D) -Σ P(Xi=v) * H(D|Xi=v), where P(Xi=v) is the proportion of samples with Xi=v in D. • Calculate the split information Si = -Σ P(Xi=v) * log2(P(Xi=v)).
• Calculate the information gain ratio IG_ratioi = IGi / Si. 4. Choose the feature i with the highest information gain ratio IG_ratioi as the splitting feature. 5. Create a decision tree node with feature i and its possible values as children. 6. For each child node j of the current node: • Let Dj be the subset of D with Xi=j.
• If Dj is empty, create a leaf node with the majority class in D.
• Otherwise, recursively build the subtree rooted at node j using Dj and the remaining features in F -{i}. 7. Return the root node of the decision tree. } End Neural Network (NN) is one of the classification algorithms in which the classification is similar to the workings of the human nervous system, the existence of a collection of interconnected neurons used to perform complex learning repeatedly [23]. A NN contains a collection of inputs and outputs connected by a weighted line. The weights are adjusted during the learning phase to help the network on NN make correct class predictions from the input. NN are suitable for applications that require complex learning because NN takes a long time to carry out repeated learning [24] and adjust to empirically determined parameters and network designs [25]. In the NN model, nodes with added weight are on each path, so the NN can learn to handle the wrong datasets. NN conducts several training in one case, so the NN Algorithm has a relatively small error rate and high accuracy [26]. The prediction results are influenced by determining the learning rate value, the target error, the amount of training data used, and the initial weight [27]. The NN model has three parts: the input layer, the hidden layer, and the output layer. The input layer is a layer that contains a collection of input nodes where the input node has the input attribute values. The hidden layer is a layer after the input layer, which contains a collection of hidden nodes where the hidden node contains the input set values that have been processed with weight and bias values. The weight value is a value that states an input priority. The bias value is a constant value contained in each hidden node. The output layer is a layer that contains a collection of output nodes where the output node contains values that have been processed from several hidden nodes and become predictive values.
In this study, the learning method in the NN Algorithm is the Backpropagation method. The Backpropagation method is one of the learning methods in the NN Algorithm, where the learning method adjusts the weights repeatedly in each tuple by changing the weights carried out backward from the output layer to the hidden layer. The purpose of this adjustment is to minimize the Mean Squared Error (MSE). If the MSE is low, the predicted class with the actual class has similarities. There are two processes in the Backpropagation Neural Network (BPNN). They are feedforward which is the process of calculating each value in the input attribute carried out forward from the input layer to the output layer, and backward is the process of calculating the error value between the value on the output layer and the target value. The error value adjusts the weight until the error value has a smaller number than the target error [28]. The following steps in the BPNN can be shown as in (7).
Equation (8) is used to calculate the new input value for each unit in the hidden layer and the output layer where is the new input value for each unit in the hidden layer and output layer, is the weight value from unit in the previous layer to unit j, is the output value from the previous layer. If the calculation is done for the first time, then the value of is the value of the input layer, is the bias weight for each unit, and is the bias value for each unit. Then calculate the new output in each unit in the hidden layer and the output layer using the formula as in (8).
Where is the output value in unit j and is the input value in j-unit. Next, calculate the error value used as a stop condition using the MSE calculation formula as in (9).
Where is the output value in the dataset, is the output value calculated in the previous layer, and n is the class number. To perform the backward pass process, the first thing to do is to calculate the total error against the weight of each unit using the formula as in (10).

= × × ()
Where _i is the backward pass value of the total error against each unit in the hidden layer and output layer, is the backward pass value of the output of each unit on the output layer, the hidden layer is the input of each unit in the output layer and the hidden layer, and is the backward pass value of the input from each unit in the output layer and the hidden layer against the weight connected to each unit. Then update the weight using the formula as in (11).
Where is the weight value of the new unit , is the weight value of the old unit, is the total error value for the weight in each unit, and is the learning rate value. This is the pseudocode for the BPNN. i. Calculate the output y' of the neural network for input x by applying the weights and biases to each neuron using the activation function. ii. Calculate the error δ for each neuron in the output layer as δj = y'j(1-y'j)(yj-y'j), where yj is the desired output for neuron j. iii. Calculate the error δ for each neuron in the hidden layers using the chain rule: δj = yj(1-yj)Σ wjk δk, where wjk is the weight from neuron k to neuron j, and δk is the error for neuron k. b. Backward pass: i.
Update the weights and biases of the neural network using the error δ and the learning rate α as follows:  For each weight wjk from neuron k to neuron j: wjk = wjk + αδjyk  For each bias bj of neuron j: bj = bj + αδj 3. Repeat step 2 for a fixed number of epochs or until the error on the validation set stops improving.

D. Evaluation
The first evaluation is using the classification matrix. The matrix is performance evaluation metrics calculated from the confusion matrix. A confusion matrix is a table with as many row and column dimensions as the number of classes in the dataset used to analyze the performance of the classification algorithm. The confusion matrix is used as an evaluation of how good the quality of classifier performance. The confusion matrix has four components: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). TP is the number of positive class data that is classified correctly. TN is the number of negative class data that is classified correctly. FP is the number of negative class data that is incorrectly predicted as a positive class. FN is the number of positive class data incorrectly predicted as negative [29]. The four values are used to find the algorithm performance evaluation value: accuracy, precision, recall, and f-measure. The formula to calculate the accuracy, precision, recall, and f-measure as in (12) to (15) [30].
The second evaluation used the paired t-test used scipy, a python language-based library, in this study. To perform a paired t-test, namely by calculating the t value between the values compared using the formula can be shown as in (16) [31].
Where is the t-statistic value used to determine the significance between 2 values, is the difference value of 2 samples, is the value of standard deviation, ̅ is the mean value of the difference between 2 samples, and is the number of instances. In the paired t-test, there are two hypotheses, are 0 which means that there is no significant difference between the two values being compared and 1 there is a significant difference between the two values being compared. To determine whether 0 and 1 are accepted, an alpha value is required. The alpha value used in this study is 5%. If the t-statistic value is less than 5%, then 0 is rejected and 1 is accepted. If the tstatistic value is more than 5%, then 0 is accepted and 1 is rejected.

III. Results and Discussion
There are two results obtained from this study. The first result is the performance evaluation of each classifier combination with resamplings, such as accuracy, precision, recall, and f-measure. The second result is paired t-test results from resampling in general and a combination of the classifier with resampling based on accuracy, precision, recall, and f-measure values. The performance evaluation results of each classifier combination with resampling techniques are shown in Figure 2 to  Based on the results, the classification results without resampling of each algorithm give the best performance than the other three resampling techniques in accuracy. SMOTE gives better results based on the recall and f-measure values of the three classification algorithms. Based on the precision value, the use of resampling gives different results for each algorithm, where Random Undersampling gives the best performance of the precision value on the Gaussian Naïve Bayes Algorithm, Random Oversampling provides the best performance of the precision value in the C4.5 Decision Tree Algorithm, and SMOTE provides the best performance of the precision value in the BPNN Algorithm.
Combining the BPNN Algorithm with SMOTE provides the best performance. This is because SMOTE can provide new samples so the classification algorithm can learn more data patterns. SMOTE provides better performance than random undersampling and random oversampling because, in random undersampling, there is a possibility that essential data will be lost due to the random data deletion process so that the classifier cannot recognize more varied patterns and can cause a decrease in the performance of the classifier. Meanwhile, random oversampling provides the same data the due to the random duplicating data so that it can lead to overfitting, where the classifier has better performance because it predicts correctly on the same data so that when classifying the new data, the classifier will misclassify the class because the classifier does not recognize new data patterns. It can cause a decrease in the performance of the classifier.
The classification results from the unresampled datasets give better accuracy but give lower precision, recall, and f-measure than the other three resampling techniques because classification using unresampled datasets can provide overfitting classification results, where the classification algorithm has better performance due to the classification algorithm predicts correctly on the majority data so that when classifying new data which should be a minority class, the classification algorithm will misclassify into the majority class. This is because the classification algorithm does not learn minority data well and is better at recognizing data patterns in the majority class. The impact of this is a decrease in the performance of the classification algorithm because the FP and FN values become high because of classification errors.
The BPNN can give the best performance than Gaussian Naïve Bayes and Decision Tree with C4.5 Algorithm because, in the Gaussian Naïve Bayes, the process of calculating the probability of each attribute in each class uses a gaussian distribution, so class determination is very dependent on the mean and the standard deviation value for each attribute. If the mean and standard deviation values are more significant in a class, the algorithm will be more likely to determine a new class with a higher mean and standard deviation value. Meanwhile, in the Decision Tree with C4.5 Algorithm, there is a possibility that the entropy value of 0 will appear during the process of calculating the number of classes so that the Decision Tree model will immediately determine the class based on the attribute that has the number of 0 on a split point, causing a class determination at the beginning of the branching because there is the possibility of entropy is 0. So class determination is only determined by one attribute. This can reduce the algorithm's performance because other attributes are not chosen that can influence class determination. In the BPNN, there is a process of calculating the error between the prediction results and the actual class and adjusting the weight and as that can support more optimal classification results. The t-test results from resampling, in general, can be shown in Table 1.
To determine the significance of the results, this study used a z value of 1.960 for T-Paired and an α value for P-Paired as the threshold for determining the hypothesis. If the test results between the two resampling techniques have a T-Paired value less than the z value and a P-Paired value more than α, then the two resampling techniques do not provide significant results. The paired t-test results above gave three scenarios a yellow highlighter with a T-Paired value less than the z value and the P-Paired value more than the α value. So those three scenarios did not provide significant results.
The following result, the paired t-test, is a test between 2 combinations of classification algorithms with resampling techniques. The next paired t-test is a test between 2 combinations of classification algorithms with resampling techniques. The red column shows that the p-value is more than 5%, and the green column shows that the p-value is less than 5%. The following is the abbreviation of the combination name in the paired t-test result:  Paired t-test results between 2 combinations of classification and resampling algorithms can be shown in Figure 5 to Figure 8. Paired t-test results based on accuracy values as shown in Figure 5, the combination of NN Algorithms without resampling gives the most significant results than the other 11 combinations. Meanwhile, the Gaussian Naïve Bayes with random undersampling does not give good results. Figure  5 shows that 5 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. Paired t-test results based on precision values, as shown in Figure 6, the combination of NN Algorithms without resampling gives the most significant results than the other 11 combinations. Meanwhile, the Gaussian Naïve Bayes without resampling does not give good results. Figure 6 shows that 7 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. Paired t-test results based on recall, as shown in Figure 7, the combination of NN Algorithms without resampling gives the most significant results than the other 11 combinations. Meanwhile, the Decision Tree C4.5 Algorithm without resampling does not give good results. Figure 7 shows that 5 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results. Paired t-test results based on accuracy values as shown in Figure 8, the combination of NN Algorithms without resampling gives the most significant results than the other 11 combinations. Meanwhile, the Gaussian Naïve Bayes without resampling does not give good results. Figure 8 shows that 6 out of 9 tests comparing the classification results from using the resampling technique and without applying the resampling technique to each algorithm give significant results.

IV. Conclusion
Based on the results and discussion of the research that has been done, it can be concluded that. The BPNN with SMOTE performs best based on accuracy, precision, recall, and f-measure. The mean and paired t-test values are better than the 11 combinations of classification algorithms and other resampling techniques. The combination of the classification algorithm and the resampling technique does not provide significant results based on the type of evaluation: (1) based on the accuracy, combining the Gaussian Naïve Bayes with random undersampling does not provide the most significant performance results; (2) based on precision, and f-measure, combining the Gaussian Naïve Bayes without resampling does not provide the most significant performance results; (3) combining the Decision Tree C4.5 Algorithm without resampling does not provide the most significant performance results based on recall. Using resampling can provide significant results on the classification algorithm's performance compared to the classification algorithm's performance on the dataset without resampling. It is shown that most of the test results from comparing classification results from datasets that apply resampling techniques and from datasets without resampling techniques give significant results. However, combining multiple resampling techniques may improve classification performance even further. Future research could explore the effectiveness of combining different resampling techniques and the impact on classification performance.