Enhancing classification performance in imbalanced datasets: A comparative analysis of machine learning models

: In the realm of machine learning, where data-driven insights guide decision-making, addressing the challenges posed by class imbalance in datasets has emerged as a crucial concern. The e ff ectiveness of classification algorithms hinges not only on their intrinsic capabilities but also on their adaptability to uneven class distributions, a common issue encountered across diverse domains. This study delves into the intricate interplay between varying class imbalance levels and the performance of ten distinct classification models, unravelling the critical impact of this imbalance on the landscape of predictive analytics. Results showed that random forest (RF) and decision tree (DT) models outperformed others, exhibiting robustness to class imbalance. Logistic regression (LR), stochastic gradient descent classifier (SGDC) and na¨ıve Bayes (NB) models struggled with imbalanced datasets. Adaptive boosting (ADA), gradient boosting (GB), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and k-nearest neighbour (kNN) models improved with balanced data. Adaptive synthetic sampling (ADASYN) yielded more reliable predictions than the under-sampling (UNDER) technique. This study provides insights for practitioners and researchers dealing with imbalanced datasets, guiding model selection and data balancing techniques. RF and DT models demonstrate superior performance, while LR, SGDC and NB models have limitations. By leveraging the strengths of RF and DT models and addressing class imbalance, classification performance in imbalanced datasets can be enhanced. This study enriches credit risk modelling literature by revealing how class imbalance impacts default probability estimation. The research deepens our understanding of class imbalance’s critical role in predictive analytics. Serving as a roadmap for practitioners and researchers dealing with imbalanced data, the findings guide model selection and data balancing strategies, enhancing classification performance despite class imbalance.


Introduction
The accurate estimation of credit default risk plays a critical role in the financial industry, enabling banks and lenders to make informed decisions regarding loan approvals, credit limits, and pricing.Traditionally, credit risk assessment has relied on statistical models and expert judgment.However, with the advancements in machine learning techniques and the availability of large-scale credit data, there has been growing interest in utilizing these methods for credit risk prediction.
One significant challenge in credit risk modeling is dealing with class imbalance, where the number of default instances is significantly smaller than the non-default instances.This imbalance can lead to biased models and poor predictive performance (Thabtah et al., 2020), as most machine learning algorithms are designed to maximize overall accuracy and may struggle to accurately classify the minority class.Consequently, misclassification errors related to defaults can have severe financial implications.The consequences of misclassifying credit defaults can be significant.Khemakhem and Boujelbene (2018) argued that false negatives (predicting a non-default when it is actually a default) can expose lenders to potential losses and increased credit risk.On the other hand, false positives (predicting a default when it is actually a non-default) can result in unnecessary restrictions on credit access for borrowers and potential loss of business for lenders.The impact of class imbalance on credit risk prediction has gained attention in recent research.It is crucial to understand the behaviour and limitations of machine learning algorithms (Leo et al., 2019) under imbalanced conditions to develop robust models that effectively capture the risk associated with credit defaults.This understanding can help financial institutions enhance their decision-making processes, mitigate potential losses, and ensure fair access to credit for borrowers.
The objective of this research article is to investigate the effect of class imbalance on the estimation of default probabilities using machine learning algorithms.We aim to analyse the performance of various algorithms under different class distributions and evaluate their effectiveness in capturing default events accurately.Additionally, we will explore different techniques for addressing class imbalance, such as over-sampling, particularly ADASYN sampling and under-sampling to assess their impact on model performance.To achieve our research objectives, we will utilize a dataset of credit information, including borrower characteristics, historical payment behaviour and other relevant factors.We will compare the performance of different machine learning algorithms, such as logistic regression, random forest, gradient boosting and decision tree in estimating default probabilities.Additionally, we will evaluate the Mathews' correlation coefficient (MCC) and F1-scores of the models using metrics like the confusion matrix and area under the receiver operating characteristic curve (AUC-ROC).
The findings of this study will contribute to the existing literature on credit risk modeling by providing insights into the effect of class imbalance on default probability estimation.This systematic investigation provides a deeper understanding of the critical impact of class imbalance on predictive analytics.By evaluating ten distinct classification models using rigorous evaluation metrics such as the area under the ROC curve, Mathews' correlation coefficient (MCC) and F1-scores, this research offers empirical insights into the strengths and weaknesses of these models in the context of imbalanced datasets.With the evidence-based assessment of various classification models and balancing techniques, this study serves as a valuable guide for practitioners and researchers dealing with imbalanced datasets.The findings offer clear directions for selecting appropriate models and applying tailored data balancing strategies, ultimately enhancing classification performance in the presence of class imbalance.It will also shed light on the strengths and limitations of different machine learning algorithms in the context of imbalanced credit data.The results will be valuable for financial institutions and regulators in developing more accurate and reliable credit risk models, enhancing credit decision-making processes, and promoting fair access to credit.
In summary, this research article aims to bridge the gap in understanding the impact of class imbalance on credit risk prediction.By investigating the behavior of machine learning algorithms under imbalanced conditions and exploring techniques to address the imbalance, we seek to improve the accuracy and reliability of default probability estimation.The outcomes of this study will contribute to the advancement of credit risk modeling and have practical implications for the financial industry.

Proposed architecture
One of the major challenges when building default prediction models, is the issue of imbalanced data.Class imbalance occurs whenever one majority class's training samples vastly outnumber those of the other minority class.Research has revealed that algorithms trained on an imbalanced dataset tend to suffer from a prediction biasedness and this often results in poor performance in the minority class.This paper will be exploring the results across ADASYN sampling and under-sampling.Figure 1 below outlines the proposed methodology that is adopted in this paper.

Pre-processing
The initial and fundamental step in dealing with any sort of data is to first clean the data thoroughly and make sure that it makes sense (e.g.no nuisance entries).Given the nature of our dataset (see Section 4.), the first step was to standardise all the explanatory variables min/max scaling.The motivation for this was that whenever training an algorithm with a variable such as salary, which can range from 10000 to 665000 while a variable such as credit utilization is captured as a ratio, the classifier might assume salary is more important than credit utilization.Min/max scaling (Patro and Sahu, 2015), which was done in R-studio, scales all the numerical variables to range between 0 and 1.

K-fold cross-validation
Cross-validation consists of developing models explaining relationships among variables based on a subset of data, called the training data, and then testing the model on the testing data.K-fold cross-validation splits the data (K) times into training and testing data and then identifies the model that performs best in the aggregate (Cawley and Talbot, 2010).An even more refined approach reserves another 20% of the data for a separate testing stage, which is not part of model development and testing, but is used instead for out-of-sample testing of the model obtained on the basis of the training and testing datasets.In cross-validation, the training and testing data are separated and the testing data are used only when a best-fitting model has emerged (Anguita et al., 2012).In our case, we have used 5-fold cross validation on the 80% (in-sample) of the data and used the remaining 20% (out-of-sample) of the data for model testing in order to get more accurate results.

Missingness
There are two different strategies for handling missing data (Han et al., 2012).The first strategy is to simply ignore missing values and the second strategy is to consider imputation of missing values.

Omit missing values
The serious problem with omitting observations with missing values is that it reduces the dataset size.This is appropriate when your dataset has a small amount of missing values.There are two general approaches for ignoring missing data: listwise deletion (case deletion or complete case analysis) and pairwise deletion (available case analysis) approach.Complete case analysis approach excludes all observations with missing values for any variable of interest.This approach limits the analysis to those observations for which all values are observed which often results in biased estimate and loss of precision (Schafer and Graham, 2002).In pairwise deletion, we perform analysis with all cases in which the variables of interest are present.It does not exclude the entire unit but uses as much data as possible from every unit.The advantage of this method is that it keeps maximum available data for analysis even when some of its variables have missing values.The disadvantage of this method is that it uses different sample size for different variables (Schafer and Graham, 2002).The sample size for each individual analysis is higher than the complete case analysis.

Impute missing values
Missing data imputation is a procedure that replaces missing values with some plausible values (Rubin, 1976).The various imputation techniques aim to provide accurate estimation of population parameters so that power of data mining and data analysis techniques is not reduced.Optimal treatment to be given to the missing data depends on amount of missing data.Although there is no general rule on what percentage of missing data is bad, it is always better to do comparison of results before and after imputation.
In this paper we have adopted the median imputation method for handling missingness.Median imputation is used for numerical data and our dataset was of this composition.Median imputation is a method for handling missing values by replacing missing values in a dataset with the median value of the non-missing observations of the same variable.This method assumes that the missing data are missing at random and that the median is a good representation of the central tendency of the data.The median is calculated by first ordering the nonmissing observations of a variable and then identifying the middle value or the average of the two middle values, depending on whether the number of observations is odd or even.This imputed median value is then used to replace all the missing values of that variable.
Initially, we present a comprehensive review of the relevant literature pertaining to the research topic.Subsequently, we expound on the adopted methodology employed in this study.The models utilized are discussed and references to additional research papers are provided for supplementary understanding.Additionally, we furnish a detailed account of the analyzed dataset, coupled with an exploratory data analysis.Finally, an extensive analysis of the results of our machine learning algorithms is presented, along with recommendations for future research.

Hyper-parameter tuning
Hyper-parameter tuning in machine learning is the process of selecting the optimal values for hyper-parameters, which are parameters set by the user that control the behaviour of the learning algorithm.The goal is to find the hyper-parameters that result in the best balance of model complexity and performance.This process can be time-consuming and computationally expensive but it is an important step in developing accurate and reliable machine learning models.If default hyper-parameters were used for the models in R, the opportunity to fine-tune the models and achieve optimal performance for the specific task or dataset may have been missed.
In order to develop and evaluate the performance of our machine learning models, we utilized default hyper-parameters in the R programming language.While this approach may not have allowed for the fine-tuning of hyper-parameters to achieve optimal performance for our specific task and dataset, it allowed us to establish a baseline level of performance and compare the relative performance of different models.This information was valuable in guiding our model selection process and identifying areas for future improvement.

ADASYN sampling and under-sampling
ADASYN and under-sampling are techniques used in machine learning to address class imbalance in datasets.Under-sampling involves reducing the number of instances in the majority class to create a more balanced dataset, allowing the classifier to learn effectively from both classes.This can be achieved through methods such as random under-sampling or removing instances close to the decision boundary.On the other hand, ADASYN sampling takes a more adaptive approach by generating synthetic examples for the minority class, particularly focusing on difficult-to-learn instances.By augmenting the minority class, ADASYN aims to improve the classifier's performance and achieve better predictive accuracy.While under-sampling can lead to loss of information from the majority class, ADASYN sampling leverages the distribution of the minority class to generate synthetic samples and overcome the imbalance issue.Both techniques aim to enhance the learning process in imbalanced datasets, but they adopt different strategies to achieve a balanced representation of the classes.

Machine learning algorithms
In data analytics, (Breeden, 2021) machine learning is a set of computational methods which use experience to improve performance or to make accurate predictions.Here, the word "experience" refers to past information available to the machine learning technique, classifier.In particular, data quality and data size are at the core of machine learning and, since the success of a learning algorithm depends on the data used, machine learning is strictly related to data analysis and statistics.
Learning is a wide domain, consequently it can be ranched into subfields dealing with different types of learning.The most common partition is the one that distinguishes between supervised and unsupervised learning according to the types of training data available to the classifier (Breeden, 2021).Figure 2 depicts the word-cloud jargon of machine learning in credit risk modelling.In supervised learning, an algorithm is trained using labelled data to make predictions; this is the most common scenario when dealing with classification or regression problems.In unsupervised learning, an algorithm is fed with unlabelled data where an algorithm is tasked with learning from the data on its own and be able to make accurate predictions when given unseen data; this approach is popular in clustering and association problems.Another type of machine learning is reinforcement learning, where an intelligent agent ought to take actions in an environment in order to maximize the notion of cumulative reward.This is used largely for classification and control problems.
Decision tree (DT): Decision tree algorithm is a popular method for default prediction due to its simplicity, interpretability and its ability to handle large datasets with high dimensionality.It uses a tree-like model of decisions and their possible consequences, by recursively partitioning the feature space into smaller regions, in which the most homogeneous set of outcomes is found.However, decision trees are known to be sensitive to class imbalance since they tend to be biased towards the majority class.Breiman et al. (1984) and Fayyad and Irani (1992) give full description of the model.
k-Nearest neighbor (kNN): The k-nearest neighbour classifier (kNN) is known to be most useful instance-based learners.kNN is a non-parametric model.kNN (Yao and Ruzzo, 2006) is a nonparametric method that makes predictions based on the majority class of the k-nearest points to a given test point.It is simple and efficient but can be sensitive to the choice of k and the distance metric used.kNN has been used in the literature for default prediction, mainly in the credit risk domain.A comprehensive description of kNN is provided by Kelleher et al. (2020), Stephens and Diesing (2014) and Wilson and Martinez (1997).
Logistic regression model (LR): One of the most commonly used statistical models is the logistic regression model that explains the relationship of several covariates x to a binary response variable.The primary objective of the logistic regression model (Zhang et al., 2017) with multiple predictors is to construct a model to describe the relationship between a binary response variable and one or more predictor variables.Logistic regression is widely used in various fields, including medicine, finance, and social sciences, where binary classification tasks are common.It provides interpretable results because the coefficients can be examined to understand the impact of the features on the probability of the positive class.However, logistic regression assumes a linear relationship between the features and the log-odds, and may not perform well when dealing with complex nonlinear patterns in the data.
Naïve Bayesian approach (NB): The naïve Bayes classifier is a probabilistic classification algorithm based on Bayes' Theorem that has been widely used for default prediction problems.It makes the assumption that the predictors are independent given the class label, which is called the "naive" assumption.Despite its "naive" assumption, it has been shown to be effective in several studies and it can handle class imbalance by adjusting class weights or using techniques like oversampling, undersampling and synthetic data generation.A full description of the algorithm can be found in (De Campos et al., 2011) and (Stephens and Diesing, 2014).
Light gradient boosting machine (LGBM): LGBM is a powerful machine learning model that has gained popularity in both regression and classification tasks.It is a gradient boosting framework that utilizes tree-based learning algorithms (Ke et al., 2017).LGBM is designed to handle large-scale datasets efficiently, making it suitable for real-world applications with high-dimensional features.One of the key strengths of LGBM lies in its ability to handle imbalanced datasets effectively.It employs a technique called gradient-based one-side sampling (GOSS) to downsample the majority class during the boosting process, which helps to improve the model's performance on minority classes.This makes LGBM particularly well-suited for classification tasks where class imbalances are common.Ke et al. (2017) and Li et al. (2022) provide more context to the LGBM model.
Random forest (RF): As proposed by Breiman (2001), random forest is a combination of decision trees (Ho, 1995) used as an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classification or regression of the individual trees.Deng et al. (2018) defined random forest as a type of learning method that can be used for both classification and regression problems.Furthermore, random forests are suitable for large quantities of data with substantial noise, can prevent over-fitting, and are able to distinguish important features in classification.Breiman (2001), Calderoni et al., (2015) and Booth et al., (2015) give a more detailed discussion on random forests, including a more rigorous mathematical description.
Adaptive boosting (ADA): The Adaptive Boosting (ADA) Classifier is a popular machine learning algorithm used for classification tasks.It is a type of ensemble learning method that combines multiple weak classifiers to create a strong and accurate classifier.ADA iteratively trains a series of weak classifiers, where each subsequent classifier is designed to focus on the instances that were misclassified by the previous classifiers.This iterative process helps to improve the overall performance of the classifier.Granström and Abrahamsson (2019) provides the full details regarding the implementation of this algorithm.
Stochastic gradient descent classifier (SGDC): The stochastic gradient descent (SGDC) classifier is a popular and efficient algorithm used in machine learning for solving classification problems.It is a variant of the gradient descent optimization algorithm (Lokeswari and Amaravathi, 2018), specifically designed for large-scale datasets.The SGDC iteratively updates the model parameters by taking small steps in the direction of the steepest gradient, aiming to minimize the loss function.Unlike traditional gradient descent (Liu et al., 2021), which calculates the gradient over the entire training dataset, SGDC performs updates on randomly selected subsets of data called mini-batches.This stochastic nature of SGDC allows for faster convergence and makes it highly suitable for working with massive datasets.
Gradient boosting (GB): Gradient boosting is a powerful machine learning algorithm that combines multiple weak learners, typically decision trees, to create a strong predictive model.It operates by sequentially adding new models to correct the errors made by the previous models, thereby gradually improving its predictive accuracy.The algorithm works by optimizing a specific loss function through an iterative process.Each subsequent model is trained to minimize the errors or residuals of the previous models, using gradient descent optimization.Gradient boosting is known for its ability to handle complex nonlinear relationships in data and is widely used in various domains, including regression, classification, and ranking problems.It has gained popularity due to its high predictive performance and robustness.Dorogush et al., (2018) and Bentéjac et al., (2021) expand more on the model.
Extreme gradient boosting (XGB): Another integration technique constructed by continuous iterations of weak classifier is the extreme gradient boosting.According to Ogunleye and Wang (2019), the model was proposed by Chen and Guestrin (2016) to optimize memory usage and exploit the hardware computing power, XGB decreases the execution time with an increased performance compared to many machine learning algorithms and even deep learning models.The main idea of boosting is to sequentially build sub-trees from an original tree such that each subsequent tree reduces the errors of the previous one.In this procedure, k number of regression trees are created to ensure that the prediction of the tree cluster is as close to the actual value as possible and that the generalization capability is as high as possible.More details about the procedure can be read by Dhieb et al., (2019), Ogunleye and Wang (2019) and Chen and Guestrin (2016).

Data description
In this section we provide some information on the dataset utilised, exploratory data analysis and we also motivate the aptness for the selection of our model choice.Kaggle dataset was used in this paper, which contained 11 features and 150000 observations (Kaggle, 2023).Kaggle is a well-known platform for data science competitions, collaboration and learning.It hosts a wide variety of datasets contributed

Data Science in Finance and Economics
Volume 3, Issue 4, 354-379.by the community, covering diverse topics and domains.These datasets are often used for data analysis, machine learning projects and research.Kaggle datasets range from structured data in CSV files to images, videos and more complex data types.Table 1, gives the dictionary to the dataset being adopted in this paper.Roughly 2% of the data was missing, particularly within the monthly income variable as well as the number of dependents.This is shown visually in Figure 3.We thought it was worthwhile to check if there is no relationship in our explanatory variables before attempting to fit any models.This task is termed as checking for the existence of multicollinearity within the data.Figure 4 displays the results that were obtained after the test was conducted.The results show a very strong correlation in the Bucket 1 through Bucket 3 variable.Moreover, a high correlation was also seen between Bucket 1 and Number Real Estate Loans Or Lines.
The dataset had originally 7% (10026) positive cases and 93% (139974) negative cases.Since the objective of this paper was to investigate the effectiveness of various machine learning models under class imbalance, we have generated nine samples of different levels of class imbalance for each of the sampling techniques discussed in Section 2.3.As a result, we ended up with eighteen (18) samples as shown in Table 2.In under-sampling, the minority class was kept the same while the majority class was under sampled to meet the desired class imbalance.On the other hand, in ADASYN sampling the majority class was fixed at original observations while the minority class was over sampled to meet the desired samples of different class imbalance.

Measures of performance
We adopt the widely used measures of performance in the fields of credit risk to evaluate our classification algorithms.These include the area covered by the receiver operating characteristics (ROC)

Data Science in Finance and Economics
Volume 3, Issue 4, 354-379.curve.The ROC curve tells how much a model is capable of distinguishing between classes; an excellent model will have an ROC close to 1, a poor model will have ROC close to 0.5.The ROC curve is constructed by evaluating the fraction of "true positives"(TP) and "false positives" (FP) for different threshold values.Table 3 shows the so-called confusion matrix that contains basic ingredients that we usually report on.We report on the following metrics, Note: These formulas are derived using 2x2 confusion matrix (Table 3) (Mitchell and Mitchell, 1997), for multi-class classification or multi-label classification the formulas will be different.ROC-AUC is a measure of the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) in a binary classification problem.AUC of 1 represents a perfect classifier and AUC of 0.5 represents a random classifier.Recall gain is used when the data is imbalanced and the goal is to improve recall and it is defined as: recall gain = (recall of model -recall of baseline)

Results
This section presents an overview of the evaluation measures obtained after training the classification models in R. Prior to fitting the models, our dataset underwent a simulation process to create nine samples with varying levels of class imbalance using both under-sampling and ADASYN sampling techniques, as depicted in Figure 5.Each sample was characterized by a specific percentage of positive target variables, such as 10% in the first sample and 90% for the remaining variables.Subsequent samples followed a similar pattern, with increasing percentages of positive target variables.
The evaluation focused on five key metrics: the area under the ROC curve, Mathews' correlation coefficient (MCC), Gini coefficient, recall scores and F1-scores.The results by the Gini coefficient and the recall scores can be found in Appendix 7.These metrics were analysed for the two sampling techniques investigated in the study.The area under the ROC curve provided insights into the models' performance in distinguishing between the positive and negative classes.MCC served as a measure of the models' overall performance, taking into account true positives, true negatives, false positives, and false negatives.Last, the F1-scores provided a balanced assessment of precision and recall for the models.These evaluation measures were instrumental in assessing the performance of the classification models and drawing meaningful conclusions about their effectiveness under different levels of class imbalance.
After generating and securely saving our simulated samples with varying degrees of class imbalance,   we proceeded to the modeling phase.Each simulated dataset was divided into two parts: 80% for training the model and 20% for model testing.We employed 10 machine learning models (as discussed on Section 3.) on each of the simulated samples and reported on Mathews' correlation coefficient to compare training sets.We further used F1-scores and area under the ROC curves to validate the performance of the models on out-of-sample datasets.Table 4 below, summarises the results obtained after training and the testing the models according to Mathews' correlation coefficient.First, it is imperative to highlight the exceptional performance of the random forest (RF) and decision tree (DT) algorithms in both the training and testing phases.Even when trained on highly imbalanced datasets (Sample 1), both algorithms achieved Mathews' correlation coefficient (MCC) scores exceeding 99% and demonstrated slight improvements as the data approached balance.Remarkably, these models appeared to be relatively insensitive to class imbalance during the training stage.Visually, the dominance of RF and DT algorithms over other models is evident, as depicted in the line plots presented in Figure 6 and Figure 8.The second set of models that exhibited noteworthy performance included ADA, GB, XGB, LGBM and kNN.These models showcased significant improvements as the data became more balanced.With a 10% balanced dataset, these models achieved MCC scores ranging from 35.5-55.8%,which escalated to 54.6-61.1% for a fully balanced dataset.This trend was consistent across both sampling techniques.The line plots in Figure 6 and Figure 8 offer insights into the sensitivity of imbalanced data to these models.In contrast, LR, SGDC, and NB algorithms exhibited the poorest performance across both sampling techniques, scoring MCC values below 50% regardless of the class imbalance in the data.Further analysis on this matter will be explored in the upcoming section.
The subsequent step of the analysis involved evaluating the performance of these models on separate out-of-sample datasets, which were reserved for model testing.Once again, the standout performers were the RF and DT models.When employing under-sampling, these models achieved scores ranging from as low as 63.4% and 73.3% to as high as 65.7% and 75.5%, respectively.In the case of ADASYN sampling, their scores ranged from as low as 70.8% and 80.5% to as high as 94.2% and 97.3%, respectively.Although the scores decreased compared to the training sets, the RF and DT models continued to outperform the other models.This trend is also visually represented in the line plots depicted in Figure 7 and Figure 9.
Similarly, ADA, GB, XGB, LGBM and kNN models exhibited comparable patterns in the out-ofsample datasets as observed in the training samples, albeit with slightly lower scores.Notably, the kNN model demonstrated the most significant improvements as the data became more balanced.Despite fully balancing the datasets, the LD, LGBM, kNN, and NB models still scored MCC values below 50% when utilizing under-sampling.However, the kNN model did exhibit some enhancements in prediction quality when employing the ADASYN sampling technique, achieving an MCC score of 86.6%.On the other hand, the LD, LGBM, and NB models remained below 50% in terms of MCC scores.
The F1-scores presented in Table 5 further reinforce the observation that the RT and DT models were the top performers during the testing stage.When employing the under-sampling technique, these models achieved scores as low as 66.7% and 74.6%, respectively, which improved to 75.5% and 83.1% as the data became more balanced.In the case of ADASYN sampling, their scores improved from 73.5% and 81.6% to 96.1% and 98.2%, respectively.Across both sampling techniques, ADA, GB, XGB, and LGBM models demonstrated gradual improvements in predictive power as the data distribution approached equilibrium.SGDC and kNN models exhibited the most significant improvements as the dataset became more balanced.At a 10% class imbalance, SGDC and kNN models achieved F1-scores

Data Science in Finance and Economics
Volume 3, Issue 4, 354-379.Finally, we present visualizations of the ROC curves, depicting the area under the curve (AUC), for the testing stage across various sample sizes.The ROC curves, as shown in Figure 12, provide further confirmation of the previous findings.While random forest (RF) and decision tree (DT) models consistently exhibited superior performance compared to other classifiers across different levels of class imbalance, it is crucial for readers to pay attention to the performance improvements of all models as the data becomes more balanced.Notably, a noteworthy observation from the ROC curves is that the AUC values for the under-sampling technique, as illustrated in Figure 13, appeared to be relatively flatter and closer to the diagonal line compared to the AUC values for the ADASYN sampling technique, as depicted in Figure 12.This observation implies that ADASYN sampling tends to produce more reliable predictions compared to the under-sampling technique.
The visualizations of the ROC curves provide additional evidence of the strength of RF and DT models throughout various class imbalance scenarios.Furthermore, the results highlight the importance of considering the performance improvements of all models as data balance improves.Additionally, the ROC curves suggest that ADASYN sampling may offer enhanced prediction reliability compared to under-sampling.

Data Science in Finance and Economics
Volume 3, Issue 4, 354-379.In summary, the results highlight the superior performance of RF and DT models in both training and testing stages, emphasizing their robustness to class imbalance.Other models, such as ADA, GB, XGB, LGBM and kNN, showed improvements as the data became more balanced but did not surpass the performance of RF and DT models.LR, SGDC and NB models consistently performed poorly regardless of the sampling technique used.The findings also suggest that ADASYN sampling technique yielded more reliable predictions compared to under-sampling technique.

Discussion
The results obtained in this study align with some findings reported in related work on the topic of class imbalance and classification models.
The superior performance of random forest (RF) and decision tree (DT) algorithms, especially in the presence of class imbalance, is consistent with previous research.Alija et al., (2023) and Zhou and Wang (2012) conducted a similar investigation under class imbalance and discovered that random forest tend to outperform many state-of-art classifiers such as SVM, ANN, naïve Bayes and C4.5.RF and DT models are known for their ability to handle imbalanced datasets effectively by capturing complex decision boundaries and handling both minority and majority classes well.In the paper written by Sun et al., (2018), it was also discovered that decision tree significantly outperforms other models and is effective for imbalanced enterprise credit evaluation.The high MCC scores achieved by RF and DT models in this study support their suitability for imbalanced classification tasks, as reported in previous studies.RF and DT algorithms perform well on imbalanced data due to their inherent robustness to class imbalance (Singhal et al., 2018), the use of sampling and randomness, their ability to handle overlapping regions and the benefits of ensemble methods.DT are less affected by class imbalance as they can capture patterns in both minority and majority classes (Liu et al., 2010) during the splitting process.RF, consisting of multiple decision trees trained on bootstrap samples, introduces randomness and diversity, enabling the algorithm to learn from both classes.DTs can form partitions that help separate the minority class instances, improving classification performance.Finally, the ensemble nature of RF leverages the collective wisdom of multiple trees, further enhancing its ability to handle imbalanced data (Liu et al., 2010).
The results also align with previous studies that have highlighted the challenges faced by logistic regression, stochastic gradient descent classifier, and naïve Bayes algorithms in imbalanced classification tasks.These models often struggle to handle class imbalance, resulting in lower MCC scores and poorer performance compared to other algorithms.Logistic regression, stochastic gradient descent classifier, and naïve Bayes algorithms face challenges in imbalanced classification tasks due to the skewed class distribution (Aljedaani et al., 2022), loss function optimization, assumption of feature independence, and sensitivity to data representation.According to Das et al., (2018), the skewed class distribution can lead to biased models and difficulties in capturing patterns for the minority class.The loss functions used by LR and SGDC may prioritize the majority class, resulting in biased decision boundaries and poor performance on the minority class.NB's assumption of feature independence can disregard rare but discriminative features for the minority class.Additionally, these algorithms may struggle to find sufficient evidence to accurately model the minority class (Das et al.,, 2018) due to its underrepresentation.To overcome these challenges, techniques like resampling, adjusting class weights, using different loss functions, or employing specialized algorithms designed for imbalanced data can be applied.The consistent poor performance of LR, SGDC and NB models in this study reinforces the need to carefully select appropriate classifiers when dealing with imbalanced datasets.
ADA, gradient boosting, extreme gradient boosting, light gradient boosting machine, and k-nearest neighbors models can be sensitive to imbalanced data due to their underlying mechanisms and characteristics: 1. Data weighting and boosting: Models like ADA, GB, XGB and LGBM utilize boosting techniques, where multiple weak classifiers are combined to form a strong classifier.In the presence of imbalanced data, these models tend to assign higher weights to misclassified instances from the minority class during the training process.This weighting scheme can result in an overemphasis on the minority class, potentially leading to misclassifications and biased decision boundaries (Okey et al., 2022).
2. Loss function optimization: Boosting algorithms aim to minimize a loss function by iteratively fitting models to the training data.In imbalanced datasets, the loss function (Fernando and Tsokos, 2021) used may not adequately capture the cost of misclassifying the minority class.As a result, the models might prioritize minimizing the overall loss (Laradji et al., 2015), which is dominated by the majority class, leading to a bias towards the majority class and reduced performance on the minority class.
3. Nearest neighbor-based approach: kNN algorithm makes predictions based on the class labels of its nearest neighbors.In the presence of imbalanced data, the sparsity (Padmaja et al., 2007) of the minority class can lead to situations where the nearest neighbors of a minority instance predominantly belong to the majority class.This can result in misclassifications and a tendency to favor the majority class during classification.
The gradual improvements observed in the performance of these models as the data became more balanced are consistent with the notion that as the class distribution becomes more even, classifiers tend to achieve better results.This observation supports the idea that balancing techniques, such as ADASYN sampling, can alleviate the negative impact of class imbalance on the performance of classifiers.
Overall, the findings of this study align with existing research on the performance of classification models in the presence of class imbalance.The superiority of RF and DT models, the challenges faced by LR, SGDC and NB models, and the performance improvements with data balancing techniques are consistent with previous findings.These results contribute to the growing body of knowledge on class imbalance and provide further evidence of the effectiveness of certain algorithms in imbalanced classification tasks.

Conclusions
In conclusion, this study investigated the performance of various classification models in the presence of class imbalance.The results shed light on the impact of class distribution on the effectiveness of different algorithms and the importance of data balancing techniques.The findings highlight the outstanding performance of random forest and decision tree algorithms, which consistently outperformed other models in both training and testing stages.These models demonstrated robustness to class imbalance and achieved high Mathews' correlation coefficient scores even when trained on highly imbalanced datasets.The visual representations and area under the ROC curves further supported their superiority over other classifiers.
On the other hand, logistic regression, stochastic gradient descent classifier, and naïve Bayes models exhibited poor performance regardless of the class imbalance in the data.These models struggled to handle imbalanced datasets and scored lower MCC values compared to other algorithms.The study also highlighted the performance improvements of models such as ADA, GB, XGB, LGBM, and kNN as the data became more balanced.These models showed increased predictive power and achieved higher MCC scores as the class distribution became more even.The results further emphasized the effectiveness of ADASYN sampling techniques in producing more reliable predictions compared to under-sampling techniques.The findings of this study align with prior research on imbalanced classification tasks, providing further evidence of the superiority of RF and DT models and the challenges faced by LR, SGDC, and NB models.The results contribute to the existing body of knowledge on class imbalance and highlight the importance of selecting appropriate algorithms and employing data balancing techniques for improved classification performance.
Overall, this study emphasizes the need for careful consideration of the choice of classification models and the implementation of data balancing techniques when dealing with imbalanced datasets.The results can inform practitioners and researchers in selecting the most suitable models for imbalanced classification tasks and guide the development of more effective approaches to address class imbalance challenges.This thorough investigation provides a deeper comparison of class imbalance's pivotal influence on predictive analytics.Through assessing ten diverse classification models using robust evaluation metrics, such as ROC curve area, Mathews' correlation coefficient (MCC) and F1-scores, this study furnishes empirical insights into these models' strengths and weaknesses within imbalanced datasets.Guided by data-driven model assessments and balancing approaches, this research serves as a valuable roadmap for practitioners and researchers grappling with imbalanced datasets.The results provide explicit guidelines for model selection and tailored data balancing techniques, ultimately enhancing classification performance in the face of class imbalance.
However, this study has certain limitations that should be acknowledged.First, the analysis was conducted using a specific dataset with its own characteristics, and the results may not generalize to other datasets or domains.Therefore, it is crucial to validate these findings on different datasets to ensure their applicability in diverse contexts.In our case, we have simulated eighteen (18) different samples.Second, the study focused solely on the performance of classification models and did not delve into the underlying reasons for the observed differences in performance.Future research could explore the specific factors contributing to the effectiveness or ineffectiveness of different models in handling class imbalance.This could involve examining feature importance, model interpretability, or identifying specific patterns in the data that affect model performance.

Funding
This work is based on the research supported wholly/in part by the National Research Foundation of South Africa (Grant Number 126885).This work is based on research supported in part by the Department of Science and Innovation (DSI) of South Africa.The grant holder acknowledges that opinions, findings, and conclusions or recommendations expressed in any publication generated by DSI-supported research are those of the authors and that the DSI accepts no liability whatsoever in this regard.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article, except for the machine learning models used for the analysis.This section provides more results that were obtained from training and testing of the machine learning models as summarised by the Gini score (Table 6) and the recall measure (Table 7).The results also conformed with the findings obtained using MCC, F1-score and the AUROC measures in Section 5.Once again, decision tree and random forest models outperformed the rest of the models across various samples of varying class imbalance regardless of the sampling technique used.

Figure 2 .
Figure 2. Machine learning in statistics.
Recall scoreAlso known as true positive rate, it evaluates the model's ability to identify actual positive instances out of all the true positive instances.It is a measure of the model's sensitivity to detecting positive cases.

Figure 10 .
Figure 10.Training and testing sample by Mathews' correlation coefficient.
This pattern held true for both sampling techniques employed in this study.On the other hand, LR and NB models performed consistently regardless of the model or sampling technique used.
Additional training and testing results summarised by Gini and recall measures.

Table 6 .
Training results by the gini score.

Table 7 .
Training and testing results by recall.Model Sample 1 sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Data Science in Finance and Economics Volume 3, Issue 4, 354-379.