Combining Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) and Hybrid Sampling in Handling Multi-Class Imbalance and Overlapping

— The class imbalance problem in the multi-class dataset is more challenging to manage than the problem in the two classes and this problem is more complicated if accompanied by overlapping. One method that has proven reliable in dealing with this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) method which is classified as a hybrid approach that combines sampling and classifier ensembles. However, in terms of diversity among classifiers, a hybrid approach that combines sampling and classifier ensembles will give better results. HAR-MI provides excellent results in handling multi-class imbalances. The HAR-MI method uses SMOTE to increase the number of samples in the minority class. However, this SMOTE also has a weakness where an extremely imbalanced dataset and a large number of attributes will be over-fitting. To overcome the problem of over-fitting, the Hybrid Sampling method was proposed. HAR-MI combination with Hybrid Sampling is done to increase the number of samples in the minority class and at the same time reduce the number of noise samples in the majority class. The preprocessing stages at HAR-MI will use the Minimizing Overlapping Selection under Hybrid Sampling (MOSHS) method, and the processing stages will use Different Contribution Sampling. The results obtained will be compared with the results using Neighbourhood-based under-sampling. Overlapping and Classifier Performance will be measured using Augmented R-Value, the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. The results showed that HAR-MI with Hybrid Sampling gave better results in terms of Augmented R-Value, Precision, Recall, and F-Value


I. INTRODUCTION
The problem of class imbalance has become one of the most exciting data mining problems [1]. The class imbalance has become one of the most interesting research issues regarding data mining, machine learning, and knowledge discovery [2]. This problem occurs because most of the realworld dataset is in an imbalanced state and if it is not handled properly it will cause a class with a small number of samples to become unrepresented and reduce the level of accuracy [3]. In general, the approach to solving class imbalance problems can be divided into 3 (three), namely: data-level, algorithmlevel, and hybrid [4]. The data-level approach focuses on efforts to change the distribution of data through a process of over sampling or under-sampling. Oversampling was carried out on the minority class and under-sampling was carried out on the majority class [5]. On the other hand, the algorithmlevel approach does not change the distribution of data, but focuses on classifier efforts to pay more attention to minority classes by applying bagging, boosting, or through the ensemble process of existing classifiers [6].
Hybrid Approach is an approach that combines Data-Level and Algorithm-Level [7]. In terms of diversity and classifier performance, a hybrid approach that combines sampling and classifier ensembles will give good results [8]. The Hybrid Method is good at dealing with the binary-class imbalance and multi-class imbalance problems [9]. Multi-class imbalance problems are more difficult to handle than binaryclass imbalance, and usually, multi-class balance problems do not stand alone but are accompanied by overlapping [10]. This problem becomes even more challenging if the minority classes are in overlapping conditions [11]. JOIV : Int. J. Inform. Visualization, 5(1) -March 2021 [22][23][24][25][26] To minimize the impact of multi-class imbalance which is accompanied by overlapping, the preprocessing process has a very significant effect [12]. For this problem, the feature selection method is often used at the preprocessing stage, so the effort to apply the preprocessing stage in the hybrid approach is a wise choice [13]. One of the hybrid approach methods that was applied to preprocess and gives satisfactory results in this problem is the Hybrid Approach Redefinition-Multiclass Imbalance (HAR-MI) [14].
As with most hybrid approach methods, HAR-MI also uses the oversampling method for minority classes by using SMOTE in the feature selection process at the preprocessing stage. One of the Feature Selection methods that provide excellent results in handling overlapping is Minimizing Overlapping Selection under SMOTE (MOSS) [15], even though this oversampling process often causes overfitting [16]. Besides, other problems that are often found in the application of SMOTE are overgeneralization and noise [17]. The use of Minority Over-Sampling Techniques (M-SMOTE) and Edited Nearest Neighbor (ENN), which are a type of Hybrid Sampling, has yielded very satisfying results [18].
It would be interesting if there is a method that combines multi-class balance handling followed by overlapping and at the same time paying attention so that the sampling process does not overfit. This study will combine the use of HAR-MI with Hybrid Sampling. This study's results will be compared with Neighborhood-based under-sampling, which is one of the best methods of handling multi-class imbalance and overlapping [19].

A. Hybrid Approach
The pseudocode of the Hybrid Approach is as follows [20].

B. Hybrid Sampling
The pseudocode of the Hybrid Sampling using M-SMOTE and ENN is as follows [18].

C. Augmented R-Value
Augmented R-Value states how much overlapping occurs. The greater the Augmented R-Value, the greater the overlapping [21].
Where R , , … , ST are k class labels with | R | ≥ | | ≥ ⋯ ≥ | ST | and EFG: Dataset D containing predictors in set V. Larger A WBC is higher overlap degree of a dataset.

D. Classifier Performance
Classifier Performance was measured using the Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value. This classifier performance measurement is carried out based on the confusion matrix shown in Table 1 [22]. The Matthews Correlation Coefficient (MCC), Precision, Recall, and F-Value calculations can be seen in the following equation [18].

E. Proposed Method / Algorithm
The research stages can be seen in Fig. 1.

F. Preprocessing Using Minimizing Overlapping Selection under Hybrid Sampling (MOSHS)
The pseudocode of the preprocessing stage is as follows.

G. Processing Using Different Contribution Sampling (DCS)
The pseudocode of the processing stage is as follows.

A. Dataset Description
The multi-class imbalanced datasets used in this study were sourced from the KEEL Repository [23]. The dataset used can be seen in Table II.  Table II shows that the dataset used has various imbalance ratios, ranging from low, medium, and high imbalance ratios. Likewise, the number of samples also varied.

B. Testing Result
The first test was conducted to obtain Augmented R-Value and MCC values. The test results can be seen in Table III.  Based on Table III, it can be seen that for the Augmented R-Value, the results obtained by HAR-MI with Hybrid Sampling are better than the Neighborhood-based undersampling. The greater the Augmented R-Value, the greater the overlapping that occurs. Based on the Augmented R-Value obtained by the two methods, the greater the imbalance ratio value, the greater the tendency for overlapping to occur. The MCC value provided by HAR-MI with Hybrid Sampling is also better than that obtained by Neighborhood-based under-sampling. The second test was conducted to obtain Precision, Recall, and F-Value. The test results can be seen in Table IV.

C. Statistical Tests
To validate the results of the study, a statistical test was conducted to measure performance using the Wilcoxon Signed-Rank Test [24]. The statistical test results can be seen in Table V.

D. Discussion
Based on the test results and Statistical Tests, it can be seen that in terms of overlapping the HAR-MI method with Hybrid Sampling gives better results compared to MCC between HAR-MI with Hybrid Sampling and Neighborhood-Based Under-sampling. However, in general, the results obtained in overlapping handling are good, where the Augmented R-Value obtained is not too high. Augmented R-Value is very dependent on the imbalance ratio; the higher the value of the imbalance ratio, the higher the overlapping that occurs. There is a significant difference for Augmented R-Value and MCC between HAR-MI with Hybrid Sampling and Neighbourhood-Based Under-sampling based on statistical tests.
As for the MCC value, the results given by HAR-MI with Hybrid Sampling are still better and there is a tendency that the more classes there are, the lower the MCC value obtained. As for the Precision, Recall, and F-Value values, the results obtained show that HAR-MI with Hybrid Sampling is also better than MCC between HAR-MI with Hybrid Sampling and Neighbourhood-Based Under-sampling. The results obtained show that the higher the imbalance ratio, the value of Precision, Recall, and F-Value obtained also decreases.
Based on the results of statistical testing with the Wilcoxon Signed-Rank Test, it was found that for Augmented R-Value, the P-Value is 0.0355223, the P-Value for MCC is 0.0355223, the P-Value for Recall is 0.0312500, and the P-Value for F -Value is 0.0340064. This means that for Augmented R-Value, MCC, Recall, and F-Value, there is a significant difference between HAR-MI results with Hybrid Sampling and Neighborhood-Based Under-sampling. As for Precision, although HAR-MI results are better than Neighborhood-Based Under-sampling but based on the test results with the Wilcoxon Signed-Rank Test, there is no significant difference as indicated by the P-Value obtained> 0.05, where the P-Value obtained is 0.0625000.

IV. CONCLUSION
Based on the results in Tables III, IV, and V, it can be seen that in terms of handling multi-class imbalance and overlapping, the results obtained using HAR-MI with Hybrid Sampling give better results compared to Neighbourhood-Based Under-sampling. The results obtained show that HAR-MI with Hybrid Sampling excels at all test values such as Augmented R-Value, MCC, Precision, Recall, and F-Value.
This shows that for handling multi-class imbalance, Hybrid Sampling, which can avoid over fitting, also gives better results compared to Under-sampling or Over Sampling. Future Research can pay attention to the handling of multiclass imbalance accompanied by overlapping in a state of high yield ratio and datasets with a large number of classes and many attributes.