Improving Minority Class Recall through a Novel Cluster-Based Oversampling Technique

: In this study, we propose an approach to address the pressing issue of false negative errors by enhancing minority class recall within imbalanced data sets commonly encountered in machine learning applications. Through the utilization of a cluster-based oversampling technique in conjunction with an information entropy evaluation, our approach effectively targets areas of ambiguity inherent in the data set. An extensive evaluation across a diverse range of real-world data sets characterized by inter-cluster complexity demonstrates the superior performance of our method compared to that of existing oversampling techniques. Particularly noteworthy is its significant improvement within the Delinquency Telecom data set, where it achieves a remarkable increase of up to 30.54 percent in minority class recall compared to the original data set. This notable reduction in false negative errors underscores the importance of our methodology in accurately identifying and classifying instances from underrepresented classes, thereby enhancing model performance in imbalanced data scenarios.


Introduction
Imbalanced learning presents a significant challenge in machine learning when faced with data sets characterized by a substantial imbalance in class distribution, where one class predominates while others are relegated to minority status.Traditional machine learning algorithms, when applied to imbalanced data sets, often yield outputs with predictive biases that favor the majority class due to their larger volume.This frequently results in the development of suboptimal models [1].Effectively addressing classification challenges in this context necessitates adherence to the "majority rule" paradigm.This paradigm emphasizes the importance of having an extensive data set for proficient learning.It operates on the principle that models trained on imbalanced data sets may demonstrate a bias toward the majority class, potentially introducing biases into the outcomes.This imperative entails the generation of densely populated instances that accurately represent the essence of class concepts to ensure adequate representation in the relevant feature space, thereby facilitating the differentiation between various sample unit types.
To the best of our knowledge, the existing classification methods, while proficient in understanding minority class concepts within imbalanced data sets, have not prioritized improving the recall values for minority classes.In classification, it is crucial to acknowledge the reciprocal relationship between minority class recall and precision.Minority class recall signifies the model's ability to correctly identify instances of the minority classes, while precision measures the proportion of correctly classified positive predictions among all positive predictions made by the model.Boosting recall, especially for minority classes, enhances the model's capacity to capture more instances of these classes.However, this increase in recall may lead to lower precision, as the model may also classify more false positives.Therefore, achieving an optimal balance between recall and precision is a critical consideration in developing models for imbalanced data sets, and it is the focal point of this study.
The significance of this challenge resonates widely across diverse fields, illustrating its profound implications.Beyond mere quantitative metrics, the ramifications of minority class recall and false negatives translate into tangible outcomes, underscoring the imperative of finding effective solutions.Within the domain of computational science and practical application, our study scrutinizes their fundamental principles.For instance, in fraud detection-a critical endeavor where the detection of deceit is paramount-insufficient recall rates for the minority class, as explicated by Chandola et al. [2], may result in undetected anomalies and consequential financial losses.These implications extend beyond mere algorithmic frameworks, profoundly affecting the financial well-being of both individuals and organizations.Similarly, in medical diagnostics, the imperative of minimizing false negatives, as emphasized by Patel et al. [3], is of utmost importance.In this context, the accuracy of predictive models significantly influences patient outcomes, and instances of false negatives may lead to missed opportunities for timely intervention, thereby adversely impacting patient welfare and prognostic outcomes.
This challenge extends across the finance and healthcare domains, as evidenced by research such as Doe et al. [4].Additionally, it encompasses the classification of legal offenses, as demonstrated by Prexawanprasut et al. [5].In the latter, the focus was on predicting recidivism using offender information and attitude scores, categorizing individuals into class 0 for first-time offenders and 1 for recidivists.It is notable that conventional imbalanced models showed suboptimal performance in predicting class 1, with reduced precision and recall scores, leading to numerous false negative predictions, where individuals prone to recidivism were inaccurately classified as non-recidivists.
Our proposed algorithm's primary contribution lies in its strategic prioritization of augmenting minority class recall while maintaining specificity.This emphasis hinges upon specific parameter settings, the comprehensive establishment of which for real-world contexts is pivotal for ensuring robust performance.Our algorithm offers a novel approach that addresses the practical implications of imbalanced learning, transcending technical intricacies to align machine learning innovations with practical applications.Notably, by employing a targeted approach to generating synthetic samples within individual clusters, we mitigate the risk of introducing extraneous noise in inter-cluster regions, thereby enhancing the algorithm's effectiveness.Our contribution advocates for the adoption of this cluster-based oversampling technique as a strategic intervention to effectively address imbalanced learning challenges and bridge the gap between machine learning advancements and their real-world implementation.

Literature Review
In the realm of imbalanced data classification, this literature review focuses on three key categories of approach: conventional resampling methods, cost-sensitive learning methods, and ensemble methods.These categories encompass techniques that address challenges arising from skewed class distributions and disparate misclassification costs.Our selection of papers within each category aims to capture seminal and contemporary works, providing a comprehensive exploration of strategies to tackle imbalanced data sets.Notably, the chosen papers inspired and informed the approach proposed in our study, enhancing our novel solution.

Conventional Resampling Method
Various oversampling techniques have been recognized as valuable strategies for addressing imbalances within data sets.These methods were designed to alleviate class imbalances by introducing synthetic samples, thereby augmenting the representation of the minority class.This augmentation aimed to empower learning algorithms to more effectively capture crucial patterns and characteristics inherent in the minority class, resulting in improved predictive performance and heightened recall for the minority class.
The scholarly landscape has witnessed the proposition of numerous oversampling methods, among which SMOTE (Synthetic Minority Oversampling Technique), introduced by Chawla et al. [6], stands out as one of the pioneering techniques.SMOTE operates by generating synthetic samples through interpolation between neighboring instances of the minority class, adhering to a predefined formula.
Here, x ′ i represents the synthetic sample, x i is the selected instance, ρ is a random number between 0 and 1, and x j is the neighboring instance of x i .In Figure 1, the depiction illustrates the process employed by SMOTE for generating new samples.The initial image portrays a scenario where the majority class is represented by circles, and the minority class is denoted by stars, with both classes distinctly separated.Following this, in the subsequent phase, SMOTE synthesizes new samples by creating instances interpolated between the original samples, extending into the inter-cluster region, as depicted in the second picture.However, this approach introduces a potential drawback, as it may lead the classifier to overlook sub-concepts within the sub-minority group of the minority class.
minority class.This augmentation aimed to empower learning algorithms to more effectively capture crucial patterns and characteristics inherent in the minority class, resulting in improved predictive performance and heightened recall for the minority class.The scholarly landscape has witnessed the proposition of numerous oversampling methods, among which SMOTE (Synthetic Minority Oversampling Technique), introduced by Chawla et al. [6], stands out as one of the pioneering techniques.SMOTE operates by generating synthetic samples through interpolation between neighboring instances of the minority class, adhering to a predefined formula.

𝑥′ = 𝑥 + 𝜌(𝑥 − 𝑥 )
Here, ′ represents the synthetic sample,  is the selected instance,  is a random number between 0 and 1, and  is the neighboring instance of  .In Figure 1, the depiction illustrates the process employed by SMOTE for generating new samples.The initial image portrays a scenario where the majority class is represented by circles, and the minority class is denoted by stars, with both classes distinctly separated.Following this, in the subsequent phase, SMOTE synthesizes new samples by creating instances interpolated between the original samples, extending into the inter-cluster region, as depicted in the second picture.However, this approach introduces a potential drawback, as it may lead the classifier to overlook sub-concepts within the sub-minority group of the minority class.SMOTE, while demonstrating promising outcomes in enhancing minority class recall, exhibits shortcomings in generating noisy samples and oversimplifying decision boundaries in specific scenarios [7].In response to these limitations, a technique known as the Support Vector Machine Synthetic Minority Oversampling Technique (SVMS-MOTE) was introduced, amalgamating the principles of SMOTE with Support Vector Machines (SVMs) [8].SVMSMOTE employed SVM to discern samples with the highest discriminative value, generating synthetic samples proximate to the authentic decision boundary.This approach notably augmented the diversity and quality of the synthetic samples [9].Empirical evidence supported the superior performance of SVMSMOTE, especially in scenarios characterized by complex and overlapping class distributions.Various authors have proposed optimized strategies for leveraging SVM-SMOTE to enhance the classification performance for imbalanced data sets.For example, Gao et al. [10] augmented the effectiveness of the SVM and SMOTE combination, potentially introducing novel optimization techniques tailored to this specific problem.In a distinct approach, Xie et al. [11] elevated the classification performance by adapting SVM through a sub-sampling technique.Conversely, Han et al. [12] employed a strategy wherein they defined informative instances to aid in outlining the region for synthesizing new samples by drawing SVM hyperplanes, while García et al. [13] found SVM assessment particularly valuable, utilizing it to define credit risk and, thereby, contributing significantly to this domain.
Another noteworthy oversampling method, proposed by Han et al. [14], is Border-lineSMOTE.BorderlineSMOTE specifically focuses on synthesizing samples in proximity to the decision boundary, drawing from borderline instances that are more susceptible to misclassification.By prioritizing the generation of samples from these pivotal instances, SMOTE, while demonstrating promising outcomes in enhancing minority class recall, exhibits shortcomings in generating noisy samples and oversimplifying decision boundaries in specific scenarios [7].In response to these limitations, a technique known as the Support Vector Machine Synthetic Minority Oversampling Technique (SVMSMOTE) was introduced, amalgamating the principles of SMOTE with Support Vector Machines (SVMs) [8].SVMSMOTE employed SVM to discern samples with the highest discriminative value, generating synthetic samples proximate to the authentic decision boundary.This approach notably augmented the diversity and quality of the synthetic samples [9].Empirical evidence supported the superior performance of SVMSMOTE, especially in scenarios characterized by complex and overlapping class distributions.Various authors have proposed optimized strategies for leveraging SVM-SMOTE to enhance the classification performance for imbalanced data sets.For example, Gao et al. [10] augmented the effectiveness of the SVM and SMOTE combination, potentially introducing novel optimization techniques tailored to this specific problem.In a distinct approach, Xie et al. [11] elevated the classification performance by adapting SVM through a sub-sampling technique.Conversely, Han et al. [12] employed a strategy wherein they defined informative instances to aid in outlining the region for synthesizing new samples by drawing SVM hyperplanes, while García et al. [13] found SVM assessment particularly valuable, utilizing it to define credit risk and, thereby, contributing significantly to this domain.
Another noteworthy oversampling method, proposed by Han et al. [14], is Borderli-neSMOTE.BorderlineSMOTE specifically focuses on synthesizing samples in proximity to the decision boundary, drawing from borderline instances that are more susceptible to misclassification.By prioritizing the generation of samples from these pivotal instances, BorderlineSMOTE enhances the generalization and predictive performance of minority class recall [15].It has demonstrated effectiveness in managing data sets characterized by overlapping classes and varying densities [16].
In addition to these methodologies, various other oversampling techniques have been introduced.ADASYN (Adaptive Synthetic Sampling) dynamically adjusts the density distribution of the minority class during synthetic sample generation, with a specific emphasis on challenging instances [17].Conversely, MWMOTE (Majority-Weighted Minority Oversampling Technique) [18] assigns weights to majority class instances based on their proximity to minority class instances, ensuring a balanced contribution in generating synthetic samples.Ensemble techniques, such as SMOTEBoost [19], RUSBoost [20], and BalanceCascade [21], amalgamate multiple oversampling strategies and may incorporate under-sampling to produce a diverse array of synthetic samples.These ensemble methods are designed to enhance the overall minority class recall performance and effectively address the challenges posed by imbalanced data sets.

Extreme Learning Machine and Cost-Sensitive Learning Method
Cost-sensitive learning, as manifested in the domain of machine learning, was devised to consider the expenses associated with misclassification and can be divided into two categories: techniques sensitive to costs indirectly and methods sensitive to costs directly.The direct method entailed the construction of a cost-sensitive learning algorithm by integrating diverse misclassification costs into the learning process.These methods necessitated the adaptation of traditional machine learning algorithms to accommodate the costs linked with various types of misclassifications.Common techniques within this category included adjusting class weights [7], utilizing alternative evaluation metrics such as precision recall or F1-score [22,23], and fine-tuning thresholds [24,25].These approaches indirectly addressed costs by altering the training and evaluation processes of the model.In contrast, the indirect cost-sensitive method constructed a cost-sensitive classifier either by preprocessing the training data through a set of rules or by subjecting the training data to a predefined set of rules.
The efficacy of the extreme learning machine (ELM) became apparent through its rapid learning and robust generalization capabilities in training single hidden-layer feedforward neural networks.However, the pervasive challenge of imbalanced classification across diverse fields significantly undermined classifier performance.In response to this concern, a novel approach, known as Parallel One-Class ELM (P-ELM) [26], was introduced, grounded in Bayesian methodology.Within the framework of P-ELM, the training data set underwent segmentation into k components, aligned with specific class affiliations.Subsequently, these segmented data sets were directed to individual kernel-based One-Class ELM classifiers.By employing probability density estimation based on the output function of these classifiers, the proposed method facilitated the direct determination of sample class assignments using Bayesian analysis.The efficacy of the P-ELM approach was comprehensively evaluated against alternative class imbalance learning methods, encompassing a range of benchmark data sets, spanning binary and multiclass classification contexts.Additionally, the application of P-ELM in a real-world scenario, specifically the diagnosis of blast furnace status, was considered.Empirical results convincingly underscored the pronounced effectiveness of the P-ELM methodology.
In the context of addressing imbalanced classification, the study introduced an innovative extension to the extreme learning machine (ELM) framework through a class-specific cost-sensitive mechanism [27].This novel approach adapted misclassification costs according to the importance of each class, effectively mitigating the challenges posed by imbalanced data sets.By augmenting classification performance, the methodology retained relevance for researchers aiming to devise solutions for imbalanced classification problems, employing ELMs in conjunction with customized cost-sensitive strategies.Unlike conventional cost-sensitive methods, which often required intricate parameter tuning and might have suffered from scalability issues, ELM distinguished itself due to its efficient and rapid training process, enabling it to handle large-scale imbalanced data sets more effectively.

Ensemble Method
Ensemble methods are a popular technique for improving the performance of machine learning models on imbalanced data sets.Ensemble methods combine multiple models to improve the accuracy and robustness of predictions.There are many variations of ensemble methods, such as Random Forest, AdaBoost, and XGBoost, which have been applied to imbalanced data sets with varying degrees of success.In general, ensemble methods can be effective for improving the performance of machine learning models on imbalanced data sets, but their effectiveness depends on the specific problem and data set at hand.
Random Forest [28] is a popular machine learning algorithm used for classification, regression, and other tasks.It is an ensemble learning method that combines multiple decision trees to make predictions.In a Random Forest, a large number of decision trees are created, each based on a random subset of the features in the training data.These decision trees are then combined to form a forest, where each tree's output is weighted equally in the final prediction.During training, the Random Forest algorithm creates these decision trees using a process called bootstrap aggregating, or "bagging", where each tree is trained on a randomly sampled subset of the training data.Random Forests are robust against overfitting because the individual trees are built on different subsets of the features and the training data.This reduces the chance that any one decision tree will overfit the training data and make inaccurate predictions on new, unseen data.Additionally, Random Forests can handle a large number of input features, including both categorical and continuous data.
AdaBoost [29] and XGBoost [30] have been prominent ensemble learning algorithms, widely utilized for classification and regression tasks, each characterized by distinctive approaches.AdaBoost, known as "Adaptive Boosting", constructs a robust classifier by amalgamating multiple weak classifiers, each trained on a subset of the training data, with weights assigned based on accuracy.It iteratively incorporates new weak classifiers, prioritizing misclassified data, until reaching a specified iteration count or a predetermined accuracy threshold.Conversely, XGBoost, or "eXtreme Gradient Boosting", employs decision trees as base learners and introduces a unique approach to their weighting and regularization.In XGBoost, each decision tree is trained to minimize a regularized objective function, balancing accuracy and complexity, while "gradient boosting" adjusts weights based on the gradient of the loss function.Acknowledged for scalability and speed, particularly with large data sets and high-dimensional feature spaces, XGBoost encompasses advanced features like parallel processing and GPU acceleration, rendering it well-suited for intricate machine learning tasks.The choice between AdaBoost and XGBoost is contingent on specific problem requirements, considering trade-offs among accuracy, speed, and interpretability.
In addition to Deep Ensemble, various other ensemble methods have been explored to address challenges in improving the accuracy and robustness of machine learning models.Shao-Hua Sun et al. [31] introduced a novel ensemble method, termed Deep Ensemble, which involved the combination of multiple deep neural networks trained on distinct random initializations.The paper centered on the analysis of the loss landscape of deep neural networks, demonstrating that Deep Ensemble had the capacity to enhance both accuracy and robustness in final predictions.A new approach for combining predictions from individual networks within the ensemble was proposed, involving a weighted combination of softmax outputs, with weights determined based on the geometric median of predictions.The geometric median, recognized as a robust estimator less sensitive to outliers than the arithmetic mean, was shown to improve both the accuracy and robustness of final predictions.

Cluster-Based Algorithm
Several inventive methods have been suggested to address the issues of class imbalance and skewed classification, utilizing cluster-based algorithms.These algorithms employ clustering techniques to handle imbalanced data sets by grouping similar data points together based on specific criteria.One such method involves decomposing the problem to reduce sub-problem complexity by strategically minimizing the number of probability distributions within them.By using a clustering evaluation algorithm, optimal sub-problem numbers are determined, and weighted kernelized extreme learning machines (WKELMs) [32] are employed for classifier creation, resulting in improved predictive accuracy.Comparative assessments against existing methods show the superior performance of this ensemble technique, highlighting its potential to effectively address imbalanced classification issues.KNSMOTE [33] is a conceptual algorithm that combines SMOTE and k-means to handle imbalanced medical data classification.It identifies "safe samples" and generates synthetic samples through linear interpolation, adjusting oversampling ratios based on data set imbalance.This approach yields notable improvements in sensitivity and specificity indexes when applied to medical data sets with the Random Forest algorithm.
The LR-SMOTE algorithm [34] aims to alleviate imbalanced data set challenges encountered in machine learning classification, particularly in the loose particle detection of sealed electronic components.By enhancing the traditional SMOTE algorithm, LR-SMOTE generates samples closer to the sample center, aiming to mitigate outlier samples or altered data set distributions.Experimental validation demonstrates its superior performance over SMOTE in various metrics, like G-means, F-measure, and AUC.
Leveraging density peak clustering strengths, the adaptive weighted oversampling method [35] introduces a novel approach to address classification challenges.Unlike conventional methods, density peak clustering accurately identifies sub-clusters within minority instances with varying sizes and densities.This technique effectively handles between-class and within-class imbalance issues.The adaptive determination of sub-cluster sizes and oversampling probabilities ensures the generation of synthetic minority instances tailored to data set characteristics.Moreover, a heuristic filtering strategy prevents overlap, enhancing the method's robustness.
In the pattern recognition and data mining domain, Guzmán-Ponce et al. [36] tackled class overlap and imbalance challenges.Their approach combines a two-stage undersampling technique using DBSCAN clustering and a minimum spanning tree algorithm.By simultaneously addressing both complexities, the algorithm enhances classifier efficacy, as demonstrated in extensive experimental evaluations.
Class-imbalanced data sets pose challenges in various domains like health and security.A novel hybrid approach [37] reduces majority class dominance through class decomposition and increases minority class instances using oversampling.Unlike traditional methods, this approach preserves majority class instances while significantly reducing dominance, leading to a more balanced data set and improved results.Extensive experiments validate the effectiveness of the proposed methods, contributing to the advancement of class-imbalanced data set handling techniques across diverse domains.

Proposed Algorithm
Our proposed oversampling algorithm, named ClusterOversampleG-Mean (COG), integrates several crucial components aimed at improving minority class recall in imbalanced data sets.It combines clustering and resampling techniques, customizing treatment for individual clusters while minimizing the synthesis of instances across clusters.Additionally, it determines specific majority and minority classes within each cluster.The optimization process prioritizes maximizing the G-mean as it aligns with our objective of enhancing minority class recall, primarily by minimizing false negatives without direct consideration for true positives.Consequently, the F1-score is not utilized in this context as it does not solely focus on reductions in false negatives.The F1-score, being the harmonic mean of precision and recall, balances both measures.However, in situations where there is a significant class imbalance and the focus is primarily on recall (as in the case of the minority class), the F1-score may not fully capture the trade-off between precision and recall.Unlike methods that rely on kernel functions or cost-sensitive learning, COG employs a clustering approach to facilitate oversampling.This is because the sparsity of minority classes often hampers SVM hyperplanes from accurately identifying their regions.Adjusting kernel functions at this stage may not effectively guide sample synthesis to the appropriate areas.
By breaking down the minority class problem into clustered instances, COG simplifies the SVM kernel function, enabling precise resampling where it is needed most.This approach allows classifiers to effectively capture minority classes, resulting in an overall increase in recall values.
The COG algorithm operates on the premise that imbalance issues manifest differently across various data clusters, aiming to tackle these challenges through a two-step approach.Initially, the algorithm clusters the imbalanced data set to identify distinct regions, where each cluster may encompass both majority and minority class members, reflecting the localized nature of imbalance.Subsequently, oversampling techniques are applied independently to each cluster, serving to minimize the creation of new samples in inter-cluster regions while allowing each cluster to define its own majority and minority classes.This localized oversampling fosters the emergence of new concepts within each cluster, providing the classifier with a clearer understanding of class concepts.The oversampling process iterates until the algorithm yields the highest G-mean value, thereby optimizing classifier performance in imbalanced data sets by addressing varied manifestations of imbalance across different data clusters.
In contrast to traditional oversampling techniques that uniformly generate synthetic samples throughout the data set, the COG algorithm stands out by initially partitioning samples into clusters prior to resampling.This methodology prevents the creation of new samples within inter-cluster regions, thereby improving the discernibility of subconcepts within each cluster and aiding the final classifier in identifying sample units more effectively within sub-regions.By focusing on synthesizing samples within individual clusters, COG significantly mitigates the risk of introducing irrelevant noise across intercluster regions, resulting in a clearer understanding of the distinctive class concepts inherent in each cluster.Furthermore, by allowing each cluster to autonomously define its majority and minority class boundaries, COG adapts to the unique characteristics of each data cluster, surpassing conventional cost-sensitive learning methods that apply a uniform global strategy.This tailored approach enhances the algorithm's effectiveness in handling intricate class imbalance scenarios, leading to improved model performance, characterized by substantially heightened overall recall values.Figure 2 illustrates the conceptualization of the proposed algorithm.
there is a significant class imbalance and the focus is primarily on recall (as in the case of the minority class), the F1-score may not fully capture the trade-off between precision and recall.Unlike methods that rely on kernel functions or cost-sensitive learning, COG employs a clustering approach to facilitate oversampling.This is because the sparsity of minority classes often hampers SVM hyperplanes from accurately identifying their regions.Adjusting kernel functions at this stage may not effectively guide sample synthesis to the appropriate areas.By breaking down the minority class problem into clustered instances, COG simplifies the SVM kernel function, enabling precise resampling where it is needed most.This approach allows classifiers to effectively capture minority classes, resulting in an overall increase in recall values.
The COG algorithm operates on the premise that imbalance issues manifest differently across various data clusters, aiming to tackle these challenges through a two-step approach.Initially, the algorithm clusters the imbalanced data set to identify distinct regions, where each cluster may encompass both majority and minority class members, reflecting the localized nature of imbalance.Subsequently, oversampling techniques are applied independently to each cluster, serving to minimize the creation of new samples in inter-cluster regions while allowing each cluster to define its own majority and minority classes.This localized oversampling fosters the emergence of new concepts within each cluster, providing the classifier with a clearer understanding of class concepts.The oversampling process iterates until the algorithm yields the highest G-mean value, thereby optimizing classifier performance in imbalanced data sets by addressing varied manifestations of imbalance across different data clusters.
In contrast to traditional oversampling techniques that uniformly generate synthetic samples throughout the data set, the COG algorithm stands out by initially partitioning samples into clusters prior to resampling.This methodology prevents the creation of new samples within inter-cluster regions, thereby improving the discernibility of sub-concepts within each cluster and aiding the final classifier in identifying sample units more effectively within sub-regions.By focusing on synthesizing samples within individual clusters, COG significantly mitigates the risk of introducing irrelevant noise across inter-cluster regions, resulting in a clearer understanding of the distinctive class concepts inherent in each cluster.Furthermore, by allowing each cluster to autonomously define its majority and minority class boundaries, COG adapts to the unique characteristics of each data cluster, surpassing conventional cost-sensitive learning methods that apply a uniform global strategy.This tailored approach enhances the algorithm's effectiveness in handling intricate class imbalance scenarios, leading to improved model performance, characterized by substantially heightened overall recall values.Figure 2 illustrates the conceptualization of the proposed algorithm.The COG algorithm starts by dividing the data set, D, into a training set, S, and a test set, T. It then creates the first model, M1, using the training set and calculates the G-mean using the test set, which is called the initial G-mean.The algorithm then divides the data into clusters to study the data concepts in each sub-group.For each cluster, S[i], the algorithm calculates the internal imbalance ratio and performs SMOTE with a predefined oversampling ratio, Π.It creates a new classifier from the synthesized samples and calculates its performance by comparing the new G-mean with the initial G-mean.If the performance improves, the oversampling instance is recorded, and the initial G-mean is updated to the next round.The algorithm then increases the oversampling instances and repeats each step.Finally, it creates the classifier, M3, using the data set with the highest G-mean, which is used as the decision model in various ways.The iterative nature of the COG algorithm inherently embodies a form of self-parameter tuning.Through its iterative process, the algorithm continually refines its parameters to optimize performance, specifically aiming to maximize the G-mean in each iteration.The proposed algorithm is shown below (Algorithm 1).

Experimental Configuration
This section describes the experimental setting.It involves the preparation of data sets, experimental setup, and performance measures.

Data Collection and Preprocessing
A diverse set of ten real-world data sets was employed to evaluate the outcomes of this study systematically.This approach aimed to comprehensively assess the effectiveness of the proposed strategy in enhancing the recall value across various domains.Each data set was meticulously selected to represent a distinct domain and set of challenges.For instance, while the Juvenile Delinquency data set provided insights into the factors associated with future criminal activity among juvenile offenders, its significance was balanced with other data sets in the analysis.Similarly, the Delinquency Telecom and Lending Club data sets, sourced from Kaggle, offered valuable insights into telecommunications users' payment practices and loan issuance patterns, respectively.However, the emphasis on these data sets was proportionate to their role in the broader evaluation context.Furthermore, a suite of additional data sets, including Credit Fraud, Bank Marketing, Happino, US Crime, Ecoli, Optical, and Yeast, were curated and processed with equal importance to ensure a comprehensive assessment of the proposed strategy's performance and generalizability across diverse real-world scenarios.Each data set presented unique challenges, ranging from imbalanced class distributions to complex feature sets, requiring tailored prepro-cessing approaches.By rigorously testing our algorithm across these varied data sets, we aimed to provide a holistic evaluation of its effectiveness and suitability across different real-world applications.Before applying our algorithm to these data sets, we preprocessed them to ensure data quality and standardization.The result data sets with their distinct characteristics are listed in Table 1.

Analysis Software and Hardware
In the course of this investigation, classification models were developed using authentic training and test data sets within the Windows 10 operating system, employing an Intel Core i7 7500 CPU with 8 GB of RAM.The programming environment, consisting of Python version 3.6 and relevant libraries, facilitated the implementation of Random Forest (RF), decision tree (DT), and ID3 models.Addressing imbalanced data sets and detecting outliers were accomplished through the use of Imbalanced-learn version 0.4.3 and PyOD version 0.7.4,respectively.Additionally, the Scikit-learn library version 0.20.0 supported the implementation of algorithms based on distinct criteria: Gain Ratio (GR), Information Gain (IG), and GINI index (GI).Table 2 presents the hyper-parameter settings for classification models developed as part of the investigation.These hyper-parameter configurations are used as control variables to validate algorithm performance, ensuring the accuracy and reliability of the experimental results.

Selection of Baseline Methods
Our choice of baseline methods is motivated by several key factors, including their widespread adoption in the field of imbalanced data classification and their relevance to the specific challenges our algorithm aims to address.Additionally, we aim to focus specifically on oversampling methods, excluding cost-sensitive learning, extreme learning machines, or ensemble techniques.Cost-sensitive learning techniques adjust the misclassification costs for different classes to address class imbalance but may not directly address the issue through the oversampling of minority class instances.Extreme learning machines, while effective for various classification tasks, are not specifically designed to handle imbalanced data sets or address the challenges posed by minority class instances in the border regions.Ensemble techniques combine multiple base learners to improve classification performance but may not primarily focus on oversampling the minority class instances.Furthermore, our selection criteria prioritize methods that aim to address the challenges inherent in the border areas between minority and majority classes.The selection of SMOTE, SVMSMOTE, and BorderlineSMOTE as baseline methods for comparison is guided by their prominence in the literature, relevance to the challenges of imbalanced classification, and alignment with the specific objectives of our proposed algorithm.By comparing against these established methods, we aim to provide a comprehensive evaluation of the efficacy and performance of our cluster-based oversampling approach in addressing the identified shortcomings and improving decision-making in the border regions of minority and majority classes.

Performance Metric
To evaluate the performance of our proposed algorithm, we conducted extensive experiments using a set of benchmark data sets commonly used in imbalanced learning research.These data sets cover a range of domains and exhibit varying degrees of class imbalance.We compared the performance of our novel oversampling algorithm with state-of-the-art methods, including SMOTE, SVMSMOTE, and BorderlineSMOTE, using recall, specificity, G-mean, and F1-score as performance metrics that rely on the confusion matrix in Table 3.A true positive (TP) or true negative (TN) is a data point that was accurately categorized as true or false by the algorithm.In contrast, a false positive (FP) or false negative (FN) is a data point that was wrongly classified by the algorithm.Because the problem is imbalanced, it is impossible to draw definitive judgments about the performance of the model by observing just one individual value.
Precision and recall are two variables used to measure the performance of categorization and information retrieval systems.Precision is defined as the proportion of relevant examples relative to the total number of instances retrieved.The proportion of all relevant occurrences that are recovered from storage is referred to as Recall or "Sensitivity".When calibrating the performance of a model, it is frequently possible to increase the model's precision at the expense of its recall, or vice versa.

Precision =
TP TP + FP Recall = TP TP + FN The geometric mean, often known as the G-mean, is the root of the product that measures class-wise sensitivity.This metric aims to optimize the accuracy of each class while preserving the equilibrium between their various degrees of accuracy.For binary classification, the G-mean is computed by taking the square root of the product of the recall and specificity.In circumstances involving many classes, the solution is the higher root of the product of each class's sensitivity.It is conventional for the G-mean to equal 0 if the classifier fails to recognize at least one of the classes.
The F1-score is a measure of a model's accuracy that considers both precision and recall.It provides a single score that balances the trade-off between precision and recall.It is particularly useful when dealing with imbalanced data sets, where one class may dominate the other.Both the F1-score and G-mean capture the trade-off between type I and type II errors, with the F1-score focusing on precision and recall, while the G-mean focuses on sensitivity and specificity.
Moreover, in our ongoing efforts to refine and understand the behavior of our algorithm, we employ information entropy as a crucial metric to scrutinize and control the ambiguity within the overlapping area of synthetic data sets.Information entropy quantifies the extent of overlap between distinct classes, especially in scenarios where decision boundaries are intricate, and instances exhibit shared characteristics.Utilizing the information entropy confirms the behavior of our algorithm in the presence of ambiguous regions, particularly within synthetic data sets.This strategic use of information entropy enhances our ability to systematically assess the algorithm's performance, providing valuable insights into its adaptability and decision-making processes, especially when faced with challenging distinctions and ambiguous feature spaces.The formula for the information entropy H(X) of a discrete random variable X with possible outcomes x 1 , x 2 , . .., x n and probability mass function P(X) is given by where P(x i ) is the probability of outcome x i , and the sum is taken over all possible outcomes.The logarithm is usually taken with base 2, which leads to the entropy being measured in bits.If the base of the logarithm is e (natural logarithm), the entropy is measured in nats.Comparing the information entropy values before and after applying the proposed algorithm, a reduction in information entropy indicates a reduction in uncertainty and, consequently, a decrease in ambiguity in the identified area.
The term "Misclassified Minority Instances" is defined to denote instances within the minority class that are situated in ambiguous regions and are erroneously classified by baseline classifiers, specifically Support Vector Machine (SVM).These instances encapsulate scenarios wherein classifiers encounter challenges in accurately discerning minority class samples.To define these misclassified minority instances, we first employ SVM to classify the minority class instances that are misclassified, and then utilize K Nearest Neighbors (KNN) to identify the majority class instances surrounding this informative minority instances, thereby isolating them as examples of ambiguous classification.The metric "Percent of Misclassified Minority Instances" for each data set is the proportion of these intricate instances within the minority class, providing insight into the classification challenges faced by the models.Simultaneously, "Information Entropy" serves as a quantitative measure to characterize the degree of ambiguity inherent in the data sets.Information entropy is calculated specifically from the ambiguous area defined by KNN, offering a robust assessment of the uncertainty present in the data.Figure 3 Illustrates the steps to identify instances for calculating information entropy.
posed algorithm, a reduction in information entropy indicates a reduction in uncertainty and, consequently, a decrease in ambiguity in the identified area.
The term "Misclassified Minority Instances" is defined to denote instances within the minority class that are situated in ambiguous regions and are erroneously classified by baseline classifiers, specifically Support Vector Machine (SVM).These instances encapsulate scenarios wherein classifiers encounter challenges in accurately discerning minority class samples.To define these misclassified minority instances, we first employ SVM to classify the minority class instances that are misclassified, and then utilize K Nearest Neighbors (KNN) to identify the majority class instances surrounding this informative minority instances, thereby isolating them as examples of ambiguous classification.The metric "Percent of Misclassified Minority Instances" for each data set is the proportion of these intricate instances within the minority class, providing insight into the classification challenges faced by the models.Simultaneously, "Information Entropy" serves as a quantitative measure to characterize the degree of ambiguity inherent in the data sets.Information entropy is calculated specifically from the ambiguous area defined by KNN, offering a robust assessment of the uncertainty present in the data.Figure 3 Illustrates the steps to identify instances for calculating information entropy.To calculate the information entropy, we first determine the proportions of misclassified minority instances: and correctly classified majority instances:  within the data set.These proportions represent the relative frequencies of each class within the ambiguous region.Using these proportions, we apply the entropy formula: In f ormation Entropy = − P minority × log 2 P minority + P majority × log 2 P majority This formula calculates the entropy based on the weighted sum of the logarithms of the proportions of each class.The resulting entropy value provides a measure of the uncertainty or randomness associated with the classification of instances within the ambiguous region.A lower entropy value indicates higher certainty or more distinct separation between the classes, while a higher entropy value suggests greater ambiguity or overlap between the classes.

Improvement in Minority Class Recall
The COG algorithm underwent testing using ten real-world data sets to assess the efficacy of the study's outcomes.Detailed examination of the initial three data sets was conducted due to their representativeness in showcasing the group's findings.Significant improvements in minority class recall were detected within the Delinquency Telecom data set, accompanied by a notable increase in the G-mean and consistent F1-score.These findings strongly suggest a reduction in false negative predictions, aligning with the primary objective of the algorithm under study.There was a noteworthy 29.01 percent increase in the recall rate compared to the original data set, resulting in an impressive recall rate of 59.95 percent.This enhancement was realized through data segmentation into four clusters, coupled with resampling applied to clusters 0 and 1, employing a resampling ratio of 1.00.The increase in recall resulted in slight and manageable effects on the other performance metrics, maintaining overall stability, as illustrated in Table 4. Statistical analysis was conducted to compare the performance metrics (minority class recall, specificity, G-mean, F1-score) of the proposed method (COG) with other methods (original IR, SMOTE, SVM, BorderlineSMOTE) using ANOVA followed by Tukey's post hoc test.The results revealed a significant difference in minority class recall among the methods (F = 10.23,p < 0.05).Tukey's post hoc test further indicated that the proposed method (COG) demonstrated significantly higher performance in minority class recall compared to the other methods (p < 0.05), while no significant differences were observed in other performance metrics (specificity, G-mean, F1-score).
The COG algorithm yielded a substantial enhancement in the minority class recall within the Juvenile Delinquency data sets, achieving an 80.56 percent recall rate.This marks a significant increase of 37.35 percent from the pre-oversampling recall rate.While COG achieved the highest recall rate for the minority class, other metrics demonstrated slightly higher values compared to baseline oversampling techniques.Conversely, findings from the Lending Club data set suggest that COG could enhance the minority class recall by 78.58 percent.Tukey's post hoc test further identified that COG outperformed other methods in terms of minority class recall (p < 0.05) in the Juvenile Delinquency and the Lending Club, following an ANOVA test with F-values of 10.23 and 9.75, respectively.The experimental results for both data sets are detailed in Tables 5 and 6.The COG algorithm was put to the test on additional open data sets, and the results were mixed.It was able to significantly improve the minority class recall in the Ecoli and Yeast data sets, outperforming other oversampling algorithms by a significant margin.However, in the Credit Fraud data set, it lost out to SMOTE by 3.27 percent, and in Bank Marketing, it earned slightly less than SMOTE and SVMSMOTE.Table 7 provides a comprehensive comparison of the proposed technique's performance in terms of minority class recall against SMOTE, SVMSMOTE, and BorderlineSMOTE across various data sets.Despite these mixed results, the suggested technique still showed promise in improving the classification completeness of tasks that require a high recall rate.As previously mentioned, COG aimed to enhance the recall of forecasting minority class occurrences without adversely affecting other performance metrics.The suggested method showed a minimal effect on F1-score, resulting in average increases of 3.54, 2.72, and 3.69 percent, respectively.In addition, it improved the G-mean by an average of 4.95, 4.25, and 5.53 percent, respectively, compared to existing comparison techniques.In some individual data sets, the G-mean may have decreased, corresponding to a decrease in minority class recall.Tables 8-10 provide the results of the proposed algorithm compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in terms of specificity, G-mean, and F1-score on various data sets.

Influence on Additional Performance Measures
Based on the comprehensive analysis of experimental outcomes across ten real-world data sets, the results can be categorized into four distinct partitions.Initially, attention is drawn to data sets wherein the COG algorithm exhibits commendable performance, exemplified by its efficacy in the Delinquency Telecom data set.Here, a noteworthy improvement in minority class recall is evident, without concomitant alterations in other metrics.Similarly, analyses conducted on the Juvenile Delinquency, Ecoli, and Yeast data sets reveal a concurrent enhancement in both the G-mean and F1-score, indicative of a reduction in false negatives coupled with an amplification in true negatives.The confusion matrices for the Delinquency Telecom, Juvenile Delinquency, Ecoli, and Yeast data sets are presented accordingly in Figures 4-7.
provement in minority class recall is evident, without concomitant alterations in other metrics.Similarly, analyses conducted on the Juvenile Delinquency, Ecoli, and Yeast data sets reveal a concurrent enhancement in both the G-mean and F1-score, indicative of a reduction in false negatives coupled with an amplification in true negatives.The confusion matrices for the Delinquency Telecom, Juvenile Delinquency, Ecoli, and Yeast data sets are presented accordingly in Figures 4-7.Subsequently, a second group of results emerges, characterized by the COG algorithm's superior performance relative to some baseline methods.Notably, in the examination of the Lending Club data set, notable enhancements in G-mean, alongside consist-   Subsequently, a second group of results emerges, characterized by the COG algorithm's superior performance relative to some baseline methods.Notably, in the examination of the Lending Club data set, notable enhancements in G-mean, alongside consist-   Subsequently, a second group of results emerges, characterized by the COG algorithm's superior performance relative to some baseline methods.Notably, in the examination of the Lending Club data set, notable enhancements in G-mean, alongside consistently robust F1-scores, signify superior performance compared to results obtained Subsequently, a second group of results emerges, characterized by the COG algorithm's superior performance relative to some baseline methods.Notably, in the examination of the Lending Club data set, notable enhancements in G-mean, alongside consistently robust F1scores, signify superior performance compared to results obtained through the application of BorderlineSMOTE.Similarly, within the US Crime data set, a parallel enhancement in both the G-mean and F1-score is observable, notably surpassing outcomes achieved via SMOTE.The confusion matrix for the Lending Club and US Crime data set is provided accordingly in Figures 8 and 9.       the Bank Marketing data set, COG demonstrates enhanced minority class recall compared to SMOTE, with a slight improvement in G-mean suggesting a reduction in false negatives over SMOTE, albeit at the expense of a potential trade-off across various metrics.Noteworthy is COG's propensity for a more aggressive enhancement in minority class recall in comparison to SMOTE.The confusion matrix for the Credit Fraud, Happino, and Bank Marketing data sets is presented accordingly in Figures 10-12.The final category of results pertains to the distinctive attributes observed within the Optical data set.While SMOTE demonstrates commendable performance, the algorithm under scrutiny exhibits a reduction in minority class recall, deviating from the patterns observed in other data sets.It is notable that the optimization oversampling process in our algorithm fails to improve the density of the minority class, which consequently affects minority class recall negatively.Detailed insights into the performance characteristics of the Optical data set are further elucidated through the presentation of the confusion matrix, as depicted in Figure 13.The final category of results pertains to the distinctive attributes observed within the Optical data set.While SMOTE demonstrates commendable performance, the algorithm under scrutiny exhibits a reduction in minority class recall, deviating from the patterns observed in other data sets.It is notable that the optimization oversampling process in our algorithm fails to improve the density of the minority class, which consequently affects minority class recall negatively.Detailed insights into the performance characteristics of the Optical data set are further elucidated through the presentation of the confusion matrix, as depicted in Figure 13.The final category of results pertains to the distinctive attributes observed within the Optical data set.While SMOTE demonstrates commendable performance, the algorithm under scrutiny exhibits a reduction in minority class recall, deviating from the patterns observed in other data sets.It is notable that the optimization oversampling process in our algorithm fails to improve the density of the minority class, which consequently affects minority class recall negatively.Detailed insights into the performance characteristics of the Optical data set are further elucidated through the presentation of the confusion matrix, as depicted in Figure 13.
under scrutiny exhibits a reduction in minority class recall, deviating from the patterns observed in other data sets.It is notable that the optimization oversampling process in our algorithm fails to improve the density of the minority class, which consequently affects minority class recall negatively.Detailed insights into the performance characteristics of the Optical data set are further elucidated through the presentation of the confusion matrix, as depicted in Figure 13.

Reduction in Information Entropy
Elevated information entropy values indicate heightened uncertainty and ambiguity, pinpointing regions where classifiers may struggle to deliver accurate predictions.For instance, the Delinquency Telecom data set exhibits a substantial 60.04% of misclassified minority instances alongside a high information entropy value of 0.8380, elucidating the intricate nature and ambiguity embedded in the data.Conversely, data sets such as Bank Marketing and US Crime demonstrate lower misclassification rates and information entropy, suggesting a relatively more distinct demarcation between classes.Table 11 presents an analysis of the percentage of misclassified minority instances in conjunction with the information entropy.Notably, the information entropy values are derived exclusively from misclassified minority instances across a spectrum of data sets employed in our experimental investigation.The effectiveness of the COG algorithm in information entropy reduction is substantiated through a comparative analysis involving established methods, namely SMOTE, SVMSMOTE, and BorderlineSMOTE.The results, delineated in Table 12, illustrate information entropy reductions across diverse data sets.Notably, COG consistently outperforms baseline methods, as well as SMOTE, SVM SMOTE, and BorderlineSMOTE, in mitigating information entropy.For example, in the Delinquency Telecom data set, COG achieves a substantial reduction of 0.4245 compared to the original imbalanced ratio, surpassing reductions achieved by SMOTE (0.2457), SVM SMOTE (0.2471), and BorderlineSMOTE (0.2785).Similar trends emerge across data sets, underscoring the superior effectiveness of the proposed algorithm in addressing the ambiguity inherent in imbalanced data sets.Significantly, the observed reductions in information entropy align with concurrent improvements in the recall of the minority class, as evidenced in Table 7. Data sets, such as Delinquency Telecom, Juvenile Delinquency, Lending Club, Ecoli, and Yeast, demonstrate notable enhancements in minority class recall, establishing a consistent correlation between information entropy reduction and enhanced model performance in capturing minority class instances.This correlation underscores the meaningful contribution of COG in alleviating challenges in imbalanced data sets, thereby facilitating improved predictive modeling in scenarios where minority class recall holds paramount importance.

Parameter Settings Found in the Proposed Method
In our exploration of the proposed methodology's performance across diverse data sets, we delved into the intricacies of hyper-parameter settings to uncover configurations that optimize algorithm performance.The hyper-parameter configurations that emerged as optimal based on experimental findings, including the number of clusters for oversampling (with the number determined using the Elbow Method shown in parentheses), oversampling details, termination imbalance ratio, and initial imbalance ratio for each data set, are presented in Table 13.The oversampling details column displays cluster numbers, where oversampling is presented in bold, while cluster numbers with no oversampling are presented in regular text.Additionally, superscript (−) indicates minority class oversampling, and superscript (+) denotes majority class oversampling in each cluster.These parameters play a pivotal role in our methodology, shaping the clustering process and influencing the subsequent customization of treatment for individual clusters.By elucidating the discovered hyper-parameter settings, we aim to provide valuable insights into the adaptations of our methodology to different data characteristics and imbalanced scenarios, thus contributing to a deeper understanding of its performance nuances.

Discussion
Significant improvements in minority class recall were detected within Delinquency Telecom, accompanied by a notable increase in the G-mean and consistent F1-score.These findings strongly suggest a reduction in false negative predictions, aligning with the primary objective of the algorithm under study.Given the substantial class imbalance inherent in the data set, the COG algorithm strategically prioritizes augmenting the density of the minority class, thereby effectively mitigating false negatives, particularly prevalent in data sets characterized by elevated levels of ambiguity.Furthermore, across data sets, such as Juvenile Delinquency, Ecoli, and Yeast, a simultaneous enhancement in both the G-mean and F1-score implies a concurrent reduction in false negatives alongside an amplification in true negatives, particularly discernible in data sets with diminished ambiguity levels.Lower levels of ambiguity denote less intricate problem formulations, where a clearer decision boundary may facilitate the improved detection of true positives compared to data sets characterized by higher ambiguity levels.
In the examination of the Lending Club data set, it is notable that the observed augmentation in the G-mean, accompanied by a consistent F1-score, solely demonstrates superior performance when contrasted with the outcomes achieved through the application of Bor-derlineSMOTE.Plausible explanations for this phenomenon include the potential adequacy of SMOTE and SVMSMOTE in addressing the data set's inherent challenges, owing to their alignment with its underlying characteristics.Moreover, an analogous concurrent enhancement in both the G-mean and F1-score is discernible within the US Crime data set, notably surpassing the results attained via SMOTE.These findings underscore the efficacy of the COG algorithm in addressing imbalanced data sets, even when conventional imbalanced techniques prove inadequate in certain contexts.Consequently, the proposed algorithm emerges as a robust solution poised to effectively mitigate imbalanced data challenges.
The manifestation of the precision recall trade-off becomes notably apparent in the Credit Fraud, Happino, and Bank Marketing data sets.In these contexts, an increase in recall for the minority class is observed, yet there is no corresponding adjustment in the G-mean and F1-score.This circumstance arises from data sets exhibiting minimal ambiguity between majority and minority instances, posing challenges to oversampling techniques applied to the minority class without disturbing the inherent equilibrium of the majority class.The observed challenges may stem from the algorithm's inherent design, which prioritizes optimization within ambiguous regions.Particularly in data sets characterized by lower entropy, this approach might lead to overfitting or misclassification, as the algorithm vigorously endeavors to diminish information entropy, even in areas with well-defined class boundaries.
Acknowledging the limited efficacy of the proposed algorithm in enhancing performance within the Optical data set is imperative.This observation prompts a thorough consideration of the data set's intrinsic characteristics and its compatibility with the employed algorithm.It is plausible that the Optical data set presents unique attributes or complexities that impede the algorithm's capacity to effectively improve minority class recall and overall performance metrics.Notably, the absence of the minority class recall issue within the Optical data set distinguishes it from other data sets, where such challenges exist and are amenable to augmentation.This fundamental disparity underscores the necessity of comprehensively understanding data set nuances before evaluating the effectiveness of proposed methodologies.Furthermore, this discrepancy underscores the importance of adaptability in the algorithm's mechanisms, indicating that a more nuanced approach or adaptive thresholds may be necessary to mitigate over-optimization in data sets characterized by lower entropy.Additionally, due consideration should be given to the complexity of the model, as data sets manifesting clearer patterns may benefit from simpler models that generalize effectively.

Conclusions
The COG algorithm emerges as a promising methodology for rectifying class imbalances and ameliorating performance metrics within real-world data sets.Through the strategic augmentation of minority class density, it effectively mitigates instances of false negatives, particularly discernible in data sets characterized by pronounced class disparities.Manifesting consistent enhancements across diverse domains in performance metrics such as G-mean and F1-score, the algorithm demonstrates efficacy in bolstering both true positives and true negatives, notably within data sets exhibiting clearer delineations between classes.However, challenges arise in reconciling oversampling techniques, notably in data sets with scant ambiguity between classes.Furthermore, the algorithm's encounters with difficulty in selecting data sets underscore the exigency for continued refinement and adaptability to effectively navigate the variegated intricacies of data set complexity.It is imperative to acknowledge that the efficacy of the COG algorithm appears contingent upon the complexity of instances within the realm of imbalance, with the degree of success varying across data sets characterized by distinct levels of ambiguity and class distribution intricacies.
However, it is essential to acknowledge the algorithm's susceptibility to various factors, including sample size, class distribution, feature complexity, and imbalanced data ratios.This sensitivity underscores the necessity for meticulous parameter tuning to attain optimal performance, a process known for its time-consuming and challenging nature.Nonetheless, the incorporation of self-tuning mechanisms within our algorithm represents a notable feature.This autonomous optimization capability partially mitigates the complexity inherent in parameter tuning, elucidating oversampling intricacies to a certain extent.Despite these inherent intricacies, our algorithm's performance merits attention.It demonstrates robust generalization capabilities across validation data sets and exhibits adaptability and efficacy when applied to real-world data sets.
Looking forward, future research endeavors could delve into enhancing the algorithm's robustness, with the aim of reducing sensitivity to parameter variations.Further exploration of automated or semi-automated parameter-tuning mechanisms may streamline the optimization process.Additionally, in consideration of the evolving landscape of machine learning, the integration of advanced techniques such as neural architecture search or ensemble learning holds promise for unlocking new dimensions of performance.

Figure 3 .
Figure 3. Identifying instances for information entropy calculation.

Figure 3 .
Figure 3. Identifying instances for information entropy calculation.To calculate the information entropy, we first determine the proportions of misclassified minority instances: P minority and correctly classified majority instances: P majority within the data set.These proportions represent the relative frequencies of each class within the ambiguous region.Proportion of misclassified minority instances P minority = Number o f misclassi f ied minority instances Total number o f instances Proportion of correctly classified majority instances P majority = Number o f correctly classi f ied majority instances Total number o f instances

Figure 4 .
Figure 4. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Delinquency Telecom data set.

Figure 4 . 25 Figure 5 .
Figure 4. Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Delinquency Telecom data set.Informatics 2024, 11, x 17 of 25

Figure 6 .
Figure 6.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Ecoli data set.

Figure 7 .
Figure 7.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Yeast data set.

Figure 5 . 25 Figure 5 .
Figure 5.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Juvenile Delinquency data set.

Figure 6 .
Figure 6.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Ecoli data set.

Figure 7 .
Figure 7.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Yeast data set.

Figure 6 . 25 Figure 5 .
Figure 6.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Ecoli data set.

Figure 6 .
Figure 6.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Ecoli data set.

Figure 7 .
Figure 7.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Yeast data set.

Figure 7 .
Figure 7.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Yeast data set.

Informatics 2024, 11 , x 18 of 25 Figure 8 .
Figure 8.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Lending Club data set.

Figure 9 .
Figure 9.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using US Crime data set.The third subset of findings encompasses data sets exhibiting ambiguous performance under the COG algorithm, including Credit Fraud, Happino, and Bank Marketing.The delineation of the precision recall trade-off is particularly discernible in both the Credit Fraud and Happino data sets.In these cases, an increase in minority class recall is evident without corresponding adjustments in G-mean and F1-score.Moreover, within the Bank Marketing data set, COG demonstrates enhanced minority class recall compared to SMOTE, with a slight improvement in G-mean suggesting a reduction in false negatives over SMOTE, albeit at the expense of a potential trade-off across various metrics.Noteworthy is COG's propensity for a more aggressive enhancement in minority class recall in comparison to SMOTE.The confusion matrix for the Credit Fraud, Happino, and Bank Marketing data sets is presented accordingly in Figures 10-12. .

Figure 8 . 25 Figure 8 .
Figure 8.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Lending Club data set.

Figure 9 .
Figure 9.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using US Crime data set.The third subset of findings encompasses data sets exhibiting ambiguous performance under the COG algorithm, including Credit Fraud, Happino, and Bank Marketing.The delineation of the precision recall trade-off is particularly discernible in both the Credit Fraud and Happino data sets.In these cases, an increase in minority class recall is evident without corresponding adjustments in G-mean and F1-score.Moreover, within the Bank Marketing data set, COG demonstrates enhanced minority class recall compared to SMOTE, with a slight improvement in G-mean suggesting a reduction in false negatives over SMOTE, albeit at the expense of a potential trade-off across various metrics.Noteworthy is COG's propensity for a more aggressive enhancement in minority class recall in comparison to SMOTE.The confusion matrix for the Credit Fraud, Happino, and Bank Marketing data sets is presented accordingly in Figures 10-12. .

Figure 9 .
Figure 9.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using US Crime data set.The third subset of findings encompasses data sets exhibiting ambiguous performance under the COG algorithm, including Credit Fraud, Happino, and Bank Marketing.The delineation of the precision recall trade-off is particularly discernible in both the Credit Fraud and Happino data sets.In these cases, an increase in minority class recall is evident without corresponding adjustments in G-mean and F1-score.Moreover, within the Bank Marketing data set, COG demonstrates enhanced minority class recall compared to SMOTE, with a slight improvement in G-mean suggesting a reduction in false negatives over SMOTE, albeit at the expense of a potential trade-off across various metrics.Noteworthy is COG's propensity for a more aggressive enhancement in minority class recall in comparison to SMOTE.The confusion matrix for the Credit Fraud, Happino, and Bank Marketing data sets is presented accordingly in Figures 10-12.

Figure 10 .
Figure 10.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Credit Fraud data set.

Figure 10 . 25 Figure 11 .
Figure 10.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Credit Fraud data set.Informatics 2024, 11, x 19 of 25

Figure 12 .
Figure 12.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Bank Marketing data set.

Figure 11 . 25 Figure 11 .
Figure 11.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Happino data set.

Figure 12 .
Figure 12.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Bank Marketing data set.

Figure 12 .
Figure 12.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Bank Marketing data set.

Figure 13 .
Figure 13.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Optical data set.

Figure 13 .
Figure 13.Comparison of confusion matrix between Original Imbalanced Ratio and proposed method using Optical data set.

Table 1 .
Characteristics of data sets used in the experiment.

Table 2 .
Hyper-parameter setting for each data set.

Table 3 .
The confusion matrix.

Table 4 .
Experimental results of Delinquency Telecom data set.

Table 5 .
Experimental results of the Juvenile Delinquency data set.

Table 6 .
Experimental results of the Lending Club data set.

Table 7 .
The proposed algorithm's minority class recall compared to SMOTE, SVMSMOTE, and BorderlineSMOTE in the experimented data sets.

Table 8 .
The proposed algorithm's specificity compared to SMOTE, SVMSMOTE, and BorderlineS-MOTE in the experimented data sets.

Table 9 .
The proposed algorithm's G-mean compared to SMOTE, SVMSMOTE, and BorderlineS-MOTE in the experimented data sets.

Table 10 .
The proposed algorithm's F1-score compared to SMOTE, SVMSMOTE, and BorderlineS-MOTE in the experimented data sets.

Table 11 .
Percent of misclassified minority instances and information entropy of the examined data sets.

Table 12 .
Reduction in information entropy compared with the original imbalanced ratio and other baseline techniques.

Table 13 .
Hyper-parameter setting for each data set.