Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation

Learning from class imbalanced datasets poses challenges for many machine learning algorithms. Many real-world domains are, by definition, class imbalanced by virtue of having a majority class that naturally has many more instances than its minority class (e.g. genuine bank transactions occur much more often than fraudulent ones). Many methods have been proposed to solve the class imbalance problem, among the most popular being oversampling techniques (such as SMOTE). These methods generate synthetic instances in the minority class, to balance the dataset, performing data augmentations that improve the performance of predictive machine learning (ML) models. In this paper we advance a novel data augmentation method (adapted from eXplainable AI), that generates synthetic, counterfactual instances in the minority class. Unlike other oversampling techniques, this method adaptively combines exist-ing instances from the dataset, using actual feature-values rather than interpolating values between instances. Several experiments using four different classifiers and 25 datasets are reported, which show that this Counterfactual Augmentation method (CFA) generates useful synthetic data points in the minority class. The experiments also show that CFA is competitive with many other oversampling methods many of which are variants of SMOTE. The basis for CFAs performance is discussed, along with the conditions under which it is likely to perform better or worse in future tests.


Introduction
Imbalanced datasets create significant problems for machine learning (ML) in classification tasks [47].Classically, this problem arises in binary classification tasks when most of data comes from one class (i.e., the majority class) and less comes from the other class (i.e., the minority class).For instance, in credit-card fraud-detection, datasets always have many more non-fraudulent instances than fraudulent ones, simply because the latter are rarer than the former [11].When a given class is under-represented in the dataset in this way, a classifier's performance can be compromised in several ways; for instance, it may show poor accuracy in predicting the minority class, or spuriously high accuracy for the classifier as a whole (based only on its success with the majority class), and/or it can result in poor rule induction for decision trees [11,25,32,40,57].This class-imbalance problem has been recognized in many real-world application domains such as medical diagnosis [66], fraud detection [11], text classification [78], and detection of oil spills in satellite radar images [42].Notably, recently, some of the techniques proposed to solve the class imbalance problem have also proved useful in data augmentation for deep learning models, when new, synthetic data-points need to be generated to create the large, labelled datasets required for better performance [63,73,74].
In this literature, several important approaches have emerged to deal with this problem based on solutions at the data or algorithm levels.Data level solutions attempt to change the distribution of the imbalanced data by re-sampling the original data [9,10,11,12,27,30,31,46,58,59]; typically, these techniques either oversample the minority class or undersample the majority class or sample using some combination of both of these methods.Specifically, the Synthetic Minority Over-Sampling Technique (SMOTE) has become a very popular method for solving class-imbalance issues in traditional ML and has also been applied to the data augmentation problem in deep learners [11,31,46,63].Algorithm level solutions aim to modify the machine learning algorithms used, to mitigate their bias towards majority groups; typically, these techniques involve the use of cost-sensitive and ensemble methods [20,24].
In this paper, we explore a novel approach to both the class imbalance and data augmentation problems using an instance-based counterfactual method that generates synthetic data-points in the minority class [39,64]; interestingly, this method was previously developed to solve problems in eXplainable AI (XAI; for reviews see [37,38]).In logic, Lewis [45] proposed that counterfactuals are the closest possible world to the current world in which the outcome is different.Hence, the intuition behind the current technique is that it generates "synthetic" counterfactual instances using the actual feature-values of instances (not interpolated values) that are the close to existing instances thus populating the minority class with plausible adaptations of existing data.If this intuition is correct then the synthetic instances generated by this counterfactual method should improve ML performance, perhaps to a level that advances current class-imbalance techniques.

Related Work
The related work to the present research comes from two different strands of AI research: from (i) data-level sampling techniques for the class imbalance problem and (ii) counterfactual methods for eXplainable AI (XAI).
Data level solutions to the class imbalance problem are dominated by three main approaches: Random Over-Sampling (ROS), Random Under-Sampling (RUS), and Synthetic Minority Over-Sampling Technique (SMOTE) [11,31,46].In ROS, the class distribution is balanced by randomly adding multiple copies of some of the minority classes to the training data [46].Whereas, with RUS, a certain number of examples of the majority class are randomly removed from the original dataset [31].Although these methods can re-balance the original dataset, they have some drawbacks.Since ROS merely copies minority-class instances, no new information is added to the dataset and, hence, it can lead to overfitting [72].On the other hand, since RUS randomly removes examples from the majority class, data can be discarded that may be important [53].The third option -SMOTE --adopts a somewhat different approach based on oversampling from the minority class.As SMOTE is the baseline method used for comparisons in the present experiments, we briefly describe it in more detail here (see section 2.1) along with important SMOTE variants (see section 2.2) before going on to describe the counterfactual method we have adapted from the XAI literature (see section 2.3).Finally, we briefly sketch the recent and very small literature that has begun to apply these counterfactual XAI methods to class-imbalance and data augmentation problems (see section 2.4).

Data Sampling Methods for the Class Imbalance Problem: SMOTE
Synthetic Minority Oversampling Technique (SMOTE) oversamples the minority class by creating "synthetic" instances rather than by oversampling using replacement.It is one of the most widely used solutions to the class-imbalance problem; Google Scholar lists over 14,500 citations to the original paper [11,21].In SMOTE, the new example in the minority class is created by interpolating between several minority class instances.By interpolating instead of copying instances, SMOTE avoids the over-fitting problem, and creates new synthetic instances in neighborhoods surrounding instances in the minority class.Briefly, the algorithm works as follows.Assume that the minority class is  and the majority class is .SMOTE starts by randomly selecting a minority instance,  $ , from the minority class, , and then determines , as the nearest neighbors of  $ .
After determining  nearest neighbors of  $ , it selects a random neighbor  & : where  & ∈ .Finally, SMOTE creates a new instance  ()* using the following formula: where  is a random number between 0 and 1.This new instance is then added to the dataset for the minority class.One of the potential problems with SMOTE is that its generation of minority instances is done without reference to the majority class or, indeed, any consideration that some minority instances may be better than others to use in this data-generation process.Another issue is that it may introduce noise, by generating interpolated values that do not exist in the domain (e.g., the interpolated value could be out-of-distribution).Accordingly, many extensions have been made to SMOTE that improve on its operation.In the following sub-section, we review the SMOTE variants that are closest to the current method proposed, to reveal how it differs.

SMOTE Variants: Three Key Insights
There are many variants of SMOTE that improve on the original's performance based on several insights about how to solve the class-imbalance problem; so, these variants often hinge on identifying important/safe regions in the minority class, they emphasise the importance of focusing on the border region between the majority and minority classes, and sometimes analyze the majority class with respect to the minority to guide under/oversampling.

SMOTE in Selected
Regions.One critical improvement to the original SMOTE method hinges on the insight that not all regions in the minority class are equal, some may be more important or safer than others within which to apply SMOTE.For instance, k-Means SMOTE [19] clusters minority instances into k clusters and then oversamples from clusters with the most instances, the assumption being that these are safer regions and are less likely to generate noise (see AND-SMOTE [76] for a related solution).Other versions of this approach have used DBSCAN, a based clustering algorithm, to identify safe regions [10] or use representative points within clusters to guide SMOTE (e.g., CURE-SMOTE [49]).Some methods project the minority class into a lower dimension before applying SMOTE to the clusters found; for instance, SOMO [18] uses a self-organizing map to transform high-dimensional datasets into a two-dimensional space, and LLE-SMOTE [71] uses a locally-linear embedding algorithm to project into a lower dimension where the datasets are more separable.Still others, such as G-SMOTE and ADASYN, explore different ways to identify regions within which to generate minority instances.G-SMOTE [61] defines a geometric region around each minority class instance for generating synthetic datapoints.
ADASYN, proposed by He et al. [30], generates minority class instances according to their distributions, generating more synthetic data from minority instances that are harder to learn compared to minority instances that are easier to learn (where ease-of-learning is related to the number of instances in the k-nearest neighbors that belong to the majority class).However, these safe-region solutions owe a lot to another key insight, namely that regions close to the class boundary are particularly important for instance generation.
SMOTE On the Borderline.The idea that different regions in the dataset need to be handled differently, owes a lot to the intuition that minority instances close to the decision boundary Finally, for each instance in the DANGER set, the k-nearest neighbors from  are found and the steps from the original SMOTE method are applied to them to generate synthetic instances in the minority class.This insight about the importance of the boundary regions has been exploited in different ways.For example, SVM-SMOTE [54] uses an SVM to approximate the decision boundary and then generates new synthetic data along the lines joining each minority-class support-vector with its nearest neighbors using interpolation or extrapolation techniques.In a similar vein, M-SMOTE [34] divides minority instances into three groups (security instances, border instances, and latent noise instances) and then treats these groups differently when generating instances.Other variants in this vein, adjust the sampling rate for some minority instances (e.g., those close to the boundary) to improve these methods further (see GAS-SMOTE [36] and SMOTE-D [68]).This use of boundary regions in the minority class also raises questions about the relationship of majority instances to minority instances leading to a third insight underlying SMOTE variants; namely, that oversampling in the minority class can be informed/guided by considering the majority class.
Using the Majority Class.A third important insight in this research area, which becomes more apparent when borderlines are explored, is the idea that the relationship between the majority class and the minority class can also help guide SMOTE.Earlier, we saw that Borderline-SMOTE does not interpolate instances when the k-nearest neighbors show a preponderance of majority instances (see also ADYSYN).This is one way to take the majority class into account.Other methods explore the relationship between classes to undersample the majority class using data-cleaning techniques [5], or to guide the oversampling of the minority class [6,9].For instance, SMOTE-Tomek [5] finds pairs of instances between the minority and majority classes that are very similar (low Euclidean distance) -a Tomek Link -and then removes the majority instance in the pair; by removing these Tomek-links and applying SMOTE to the minority class, it attempts to re-balance the classes (SMOTE-ENN [5] uses a related data-cleaning approach involving the Edited Nearest Neighbour method).Other methods, SL-SMOTE [9] and SWIM [6] perform explicit analyses of the majority class and use this analysis to inform/guide minority instance generation.Safe-Level-SMOTE (SL-SMOTE) [9] does this by computing a safe-level score for each minority instance (where safety is based on the frequency of majority instances in the k-nearest neighbours) and a safe-level ratio based on the safe-level score of a minority instance over that of its neighbours.SL-SMOTE's finer analysis of the relationship between majority and minority has been shown to improve performance over B-SMOTE.Sampling WIth the Majority (SWIM) [6] adopt a different approach, leveraging information about the density of well-represented majority instances (using the Mahalanobis distances), requiring generated minority instances to have similar distances to their minority seeds.So, SWIM essentially analyses the topology of the majority class to guide the generation of minority instances (see MC-RBO [41] and LN-SMOTE [50] for related approaches).Finally, SMOTE-RSB [59] is another method that takes similarities to majority instances into account, in computing Rough Sets over the minority class, after SMOTE has been applied to generate additional minority instances; this method acts like a data-cleaning step to remove generated instances that might be noise.Many of these methods improve on B-SMOTE's performance and, as such, show that paying more attention to the majority class can play a key role in informing/guiding instance generation in the minority class.We will see later that while the current counterfactual methods reflect these three key insights about out to improve on SMOTE it is quite different from all of the above methods in how it operates, (see section 3.3).But, before considering this counterfactual method in detail, we first briefly review how it has emerged in XAI.

Counterfactual Generation in XAI
In this paper, we deploy a case-based counterfactual method to generate synthetic, minority-class instances [39,64].Counterfactual methods have been developed to generate post-hoc examples to explain the predictions of black-box ML models and to provide algorithmic recourse for end-users trying to mitigate automated decisions (for reviews see [37,38]).The classic counterfactual explanation is one that is given when an automated system refuses a person on a loan application [70]; when the end-user asks "why?" the system might counterfactually explain that "If you requested a loan for $500 less over a shorter term, then you would have been granted the loan".That is, the counterfactual explanation tells users about the conditions under which outcome would change, the closest world to their world in which the outcome would be what they desire.
Counterfactuals have been researched for some time in AI under diverse names; for instance, in the past, they have been called Nearest Unlike Neighbours (NUNs [16,51,55]) or inverse classifications [1,43].Recently, they have emerged as a hot topic in XAI because they appear to have psychological benefits (i.e., people naturally understand them) and legal benefits (i.e., they are said to be GDPR compliant).Optimization techniques are currently the most popular method for computing counterfactuals [14,52,62,70].Given a test instance (e.g., one encoding the original loan refusal) these optimization methods search a (sometimes randomly) generated space of perturbations of the query (i.e., synthetic instances) under a loss function that balances proximity to the test-instance against proximity to the decision boundary for the counterfactual class (i.e., the class that counters that of the query), using a scaled L1-norm distance-metric.Wachter, et al.'s [70] seminal method uses gradient descent to find the best counterfactual instance for a query, though later models have used other techniques (e.g., genetic algorithms).Mothilal, et al. [52] proposed the Diverse Counterfactual Explanations (DiCE) method as an extension, to generate a set of diverse counterfactual candidates avoiding the problem of generating sets of candidates that were trivial variants on one another.The main problem with these optimization methods is that, given their "blind" perturbation of test-items, they sometimes generate out-of-distribution, invalid data-points [17,44,70].This defect has potentially serious side-effects for their use in the class-imbalance problem, as it suggests that they might populate the minority class with noise, with consequential negative effects on a classifier's performance.
However, a very different cases-based approach to counterfactual generation has recently been proposed [39,64].This instance-guided method finds the test-instance's nearest-neighbor that takes part in a so-called explanation case (xc).An explanation case captures a counterfactual relation between existing instances in the dataset, that are in opposing classes either side of a decision boundary, with the constraint that the pair of instances differ by at most two feature-differences.
For example, the loan dataset could contain two existing cases that are counterfactually related; one about a "30-year old, female accountant earning $35k who was refused a $10k loan" that is counterfactually related to another instance with a different outcome, namely a "30-year old, female accountant earning $40k who was granted a $10k loan" (differences shown in italics).This explanatory case implicitly suggests that "IF one earns $40k rather than $35k THEN the loan decision is likely to be granted rather than refused".So, if I am a "30-year old, male teacher earning $35k who was refused a $10k loan" then this algorithm could find this explanatory-pair as a nearest neighbour and suggest that if this male-teacher earned $5k more ($40k rather than $35k) then the $10k loan would be granted".In the XAI context, this method has been shown to generate close, plausible counterfactuals and appears to avoid the out-of-distribution pitfalls that arise in optimization techniques (see [17,39,64]).From a data augmentation perspective, this method can be seen as supporting the creation of synthetic data-points in the minority class, using information from these known counterfactual pairs.However, few XAI techniques have been applied to data augmentation problems.In the next subsection, we briefly sketch this small literature on the topic.

Using Counterfactuals for Data Augmentation
Beyond XAI, our hypothesis is that counterfactual methods can also play a role in data augmentation to solve class-imbalance problems, that generated synthetic counterfactual cases could improve the predictive accuracy of AI models.Although, there are now 100s of papers on counterfactuals in XAI, only a handful of papers consider their use in data augmentation [29,56,65,77].
In evaluating XAI counterfactual methods, Mothilal et al. [52] suggested that a good method should generate a set of counterfactuals that can substitute for the original dataset (calling it substitutability); that is, if the set of generated counterfactuals were plausible and close to the original data then their predictive performance should parallel that of the original dataset.However, Mothilal et al.
did not consider using their counterfactual method for data augmentation.In a student project, Hasan [29] did and tried to determine whether an augmented dataset based on generated counterfactuals could act as a proxy dataset, but only found modest success.
A selection of other papers in diverse areas have also circled the issue of using counterfactual techniques for data augmentation.Subbaswamy and Saria [65] considered the problem of dataset shift, where there is a divergence between the context in which a model was trained and tested; they use the notion of "counterfactual risk" to diagnose this problem using causal models.Zeng et al. [77] proposed the Counterfactual Generator, which generates counterfactual examples for textual data and found that generated counterfactuals improved the generalizability of models under limited observational examples.Pitis et al. [56] proposed Counterfactual Data Augmentation (CoDA) for generating counterfactual experiences in reinforcement learning (RL), in which the method increases the size of available training data with counterfactual examples by stitching together locally-independent subsamples from the environment.They found that CoDA significantly improved the performance of RL agents in locally-factored tasks for batch-constrained and goal-conditioned settings.The problem with these papers is that they use bespoke counterfactual methods developed for specific task domains, rather than the tried and tested techniques from the XAI literature.Therefore, their performance, robustness and generalizability is, at best, uncertain 1 .
However, one study has applied to a well-tested XAI counterfactual method to the problem of data augmentation.Hence, this is the aim for the remainder of this paper.

The Counterfactual Augmentation Algorithm (CFA)
This paper advances the use of counterfactual methods for data augmentation as a solution to populating the minority class with more synthetic data to solve class-imbalance problems.The application of this XAI method to data augmentation was motivated by the observation that it seemed to generate plausible, synthetic datapoints for explanatory purposes; furthermore, the evaluation metrics in XAI showed that these datapoints were generally valid, within-distribution and close to existing datapoints.Accordingly, the extension of these techniques to data augmentation problems seemed like a promising avenue of research.
In this section, we describe a new oversampling method using a case-based reasoning approach to generating synthetic counterfactuals in the minority class, to be applied to binary classification problems.Consider a simple scenario to show how this method operates.

A Counterfactual Example for Class Imbalances
Imagine an ML classifier being used to predict whether farm animals are likely to be healthy or fall ill (e.g., mastitis in cows; see [60]).The dataset recording a herd of cows on most farms will be imbalanced, in that most cows will tend to be healthy rather than ill.An analysis of this dataset shows that some pairs of instances, majority-minority instance pairs, be counterfactually related to one another; for example, a majority instance, Cow-A of a certain breed, age, milk-yield and healthhistory that is classed as healthy can be counterfactually paired with, a minority instance, Cow-B that is of the same breed, age, milk-yield but with a different health-history (e.g., they have ill several times) that is classed as (likely to be) ill.This counterfactual-pair (which we call a native counterfactual) tells us that a one-feature difference in the health-history feature can change the class of a cow from healthy to ill.So, if we want to fix the class imbalance in this dataset using our counterfactual method, then one can generate a new minority instance by re-using this known counterfactual relationship.Imagine, we pick another majority instance, Cow-A' (a disease-free cow that has no existing counterfactual pair) and find a nearest neighbour to Cow-A' that is part of known counterfactual-pair (e.g., the Cow-A~Cow-B pair).Using this majority-instance and the native counterfactual-pair, we can generate a new, synthetic minority instance, using the matchfeatures of Cow-A' and the difference-features from Cow-B; so, this new instance, Cow-B', would have the matching-features of Cow-A' (for breed, age and milk-yield) and the difference-feature from Cow-B (for health-history), along with the prediction that it will be ill.So, we have now created a new minority instance, Cow-B', that is counterfactually related to Cow-A' (n.b., where the class of this new instance needs to be verified by the underlying ML model) 2 .This example describes the generation of one minority instance.In our experiments, we do this iteratively for all those majority instances that are not paired in native counterfactuals; a step that results in the generation of many more minority instances close to the decision boundary with the majority class.
In the next sub-section, we describe the algorithm more formally.

The Method: Counterfactual Augmentation (CFA)
The Counterfactual Augmentation (CFA) is a technique for generating synthetic examples in the minority class using counterfactual methods (see [39,64]) for the method used in XAI).The CFA method generates synthetic counterfactual instances in three main steps (see Figure 1): The procedure for CFA is as follows: Step 1 Compute the CF-Set for the dataset, (, ): CFA first finds all possible "good" native counterfactual pairs, (, ), between instances that already exist in a dataset, ; these native counterfactuals pair an instance in the majority space (called the paired instance) and its counterfactually-related instance in the minority space (called the counterfactual instance).In other words, for every  $ in the majority class , we find its counterfactual  $ from  in the minority class.These native counterfactuals, (, ), pair instances either side of decision boundary (they are called native because, in one sense, they already exist in the dataset).Each of these native pairs has a set of match-features and a set of difference-features, where the differences determine the class change over the decision boundary.
Step It should be noted that, tolerance is one parameter in CFA algorithm, which is used to improve the availability of good native counterfactuals in the dataset.Without tolerance, fewer counterfactuals would be found and the generative benefits of them would likely diminish.In finding matchingand difference-features between two instances for a native counterfactual, CFA computes a tolerance by finding the mean () and standard deviation () for each feature.Then it allows features to match if their values are within +/-10% of the standard deviation from the mean all the values for that feature.
Two subtle differences that distinguish this data augmentation version of the algorithm from its XAI counterpart.First, although both algorithms adopt the same definition of a "good" counterfactual pairing the do so for different reasons.On psychological grounds, Keane and Smyth [39] defined a "good" counterfactual to be one with no more than two feature-differences.From the XAI perspective, researchers argue that "sparse" counterfactuals (with fewer feature differences) are better, because people find them more understandable (confirmed by user studies [22,23]).
From a data augmentation perspective, basing synthetic counterfactuals on sparse pairs also makes sense because the implicit causal dependencies between matched and difference features are more likely to be preserved in sparse-pairs; hence, generated synthetic data-points using these sparse pairs should be more likely to be valid and within-distribution.Second, there is a critical difference between the XAI and data augmentation contexts with respect to the selection of test instances.
In XAI, the test instance is typically a novel problem for which a classifier has made a prediction, a prediction that needs to be counterfactually explained.Hence, typically, the test instance is not already in the training data.In data augmentation, the test instances used to generate synthetic counterfactual-data have to be in the training dataset; specifically, they are all the majority class instances in the training data that do not take part in native counterfactuals; this is why the "test instances" are called unpaired instances.For data augmentation purposes, the "test instances" are a residual set of majority instances (left after the native counterfactual-pairs have been identified).

How Counterfactuals Differ from SMOTE Variants
It should be apparent that this instance-based counterfactual method is quite different from SMOTE and its variants, though it is consistent with many insights from the class-imbalance literature.
First, by definition, the counterfactual method addresses regions close to the decision boundary; a good counterfactual records the minimal feature-changes that result in a class change (as in Borderline-SMOTE and SVM-SMOTE).Second, this instance-based method relies on native counterfactuals in the dataset, pairings between existing majority and minority instances and, as such, is exploiting relationships between both classes (e.g., as in ADASYN, SMOTE-RSB, SL-SMOTE).
Third, we are highly selective in the minority instances used to generate synthetic instances (as in the many clustering-driven SMOTE variants); that is, we only work of those involved in known counterfactuals with two feature differences).However, this counterfactual method is quite dif-ferent in many other significant respects.First, it does not use interpolation between majority/minority instances but rather uses the native-counterfactual as a template for generating new minority instances, (ii) it does not rely on the topology of the majority class (e.g., as in SWIM), but acts in a very local way using the counterfactual relation between a single majority instance and a minority one, (iii) does not rely on any clustering analysis of the majority and minority classes.As such, it represents quite a novel departure relative to existing SMOTE variants.

Competitive Tests of Data Augmentation Methods
In the current study, we competitively test the instance-based counterfactual method (CFA) against the benchmark techniques in the class-imbalance literature using six oversampling methods: SMOTE [11], B-SMOTE [27], ADASYN [30], SL-SMOTE [9], SVM-SMOTE [54], SMOTE-RSB [59].These specific methods were chosen based of their conceptual closeness to the CFA method, their popularity amongst SMOTE variants and their public-availability as implementations.The six techniques were tested on a representative selection of 10 commonly-used UCI/ KEEL datasets [2,3,4,8], from which 25 dataset-variants were produced, with four different ML classifiers: including, Random Forest (RF), k-nearest neighbor (k-NN), Logistic Regression (LR), and Multilayer Perceptron (MLP) models.Several alternative ML models were used because different models find different decision boundaries for a given dataset, differences that could impact the success of the counterfactual method (as it relies heavily on a model's decision boundary).The Baseline-Control for a given classifier recorded the performance of the model on a given dataset without any data augmentation applied.Several standard measures were used to assess the performance of the four methods; namely, Precision, Recall, F1, and plots of ROC curves.

Method: Datasets & Setup
Table 2 shows the main characteristics of the datasets drawn from both UCI and KEEL repositories.
As the focus is on binary classification problems and some of these datasets are multi-class, they were converted to binary classes.The One-Versus-One (OVO or 1v1) [7,33] and One-Versus-Rest (OVR or 1vR) [7,13,69] methods were used to do this conversion.The OVO method splits a multi-class classification into one binary classification dataset for each pair of classes.Whereas the OVR approach selects one of the multiple classes and predicts it against all other classes.So that one of the classes is treated as the positive (minority) class, and all other classes are treated as the negative class (majority).In this paper, the datasets were modified using both methods (one method per dataset) to vary the class ratio of class imbalance among the datasets (see Table 2).The For each classifier based on the original imbalanced dataset was also run as a baseline, without using any data augmentation method.Finally, to determine the optimal values for all classifiers, we applied hyperparameters tuning using GridSearchCV function from scikit-learn.GridSearchCV performs cross-validated grid-search across all hyperparameter combinations and finds the best score for a given classifier.To achieve this, we defined our grid of parameters for each classification method (RF, k-NN, LR, MLP) and oversampling methods (SMOTE & its variants) and then ran the grid search (see Table 1 for a full description).

Metrics & Measures
In binary classification problems, the labels can be either positive or negative.So, the prediction made by the classifier is represented as a 2 × 2 confusion matrix [35] (see Table 3).The confusion matrix summarizes the performance of classifiers for the four possible outcomes of a given classification: a true positive (TP), true negative (TN), false positive (FP) and false negative (FN).Accuracy was not used as a measure because, as discussed earlier, it can be spuriously high for imbalanced datasets.It should be noted that all datasets used in our experiments were converted to binary datasets using two of the most well-known strategies; 1v1 [7,33] and 1vR [7,13,69] (see section 4.1 for a full description).Hence, the evaluation metrics used were as precision, recall, AUC and F1 defined as follows: scores were also reported as they are used to measure the two-dimensional area that lies under the ROC curve.

Results & Discussion
Overall, the counterfactual data-augmentation method (CFA) performs better than all other SMOTE-based methods on the main metrics reported, for most of the classifiers tested (see Tables 4-7).Recall, the cross-validated datasets were tested on four classifiers (RF, k-NN, LR, MLP) with the Baseline (no data augmentation), SMOTE, B-SMOTE, ADASYN, SVM-SMOTE, SL-SMOTE, SMOTE-RSB and CFA.Tables 4-7 report the main metric (AUC-ROC) for each classifier on the datasets.For the RF classifier (Table 4), the results show that CFA does better than all the other SMOTE-based methods in 21 datasets out of 25, with SVM-SMOTE being the next best in only 2 datasets.Both ADASYN and SL-SMOTE had the highest AUC-ROC for one dataset for each.For the k-NN classifier (Table 5), it is observed that, CFA also achieved a greater improvement in AUC-ROC.The results also show that for AUC-ROC, CFA does better than all the other SMOTE-based methods in 19 out of 25 datasets, with ADASYN being the next best with 2 datasets.SMOTE, B-SMOTE, SVM-SMOTE and SMOTE-RSB had the highest AUC-ROC in 1 dataset for each method.For the LR classifier (Table 6), the results are quite different in that they show that for AUC-ROC metric, CFA doing better in only 10 out of 25 datasets, with the SMOTE-RSB being the next best with 5 datasets.Whereas Baseline, B-SMOTE, ADASYN, SVM-SMOTE and SL-SMOTE doing better in 2 datasets.Finally, for the MLP classifier (Table 7), again the results showed that CFA does better than all the other SMOTE-based methods in 16 out of 25 datasets.Whereas, SMOTE-RSB doing better in only 5 datasets, with SVM-SMOTE being the next best with 3 datasets.Baseline, B-SMOTE, and ADASYN had the highest AUC-ROC in 2 datasets for each method.Notably, these results show for certain datasets and classifiers, the SMOTE-RSB does quite well; However, when data augmentation can make a contribution, it seems to be CFA that contributes the most to performance improvements.Finally, the ROC curves, which show the trade-off between sensitivity and specificity, are presented in Figure 3 and 4. Figure 3 shows selected examples of ROC curves where CFA outperformed SMOTE-based methods, obtained for the four methods using the four classifiers on different data sets.According to Figure 3, CFA clearly outperformed other data augmentation methods (i.e., SMOTE, B-SMOTE, ADASYN, SVM-SMOTE, SL-SMOTE, SMOTE-RSB).For example, when running a RF classifier on the 'Abalone-9-vs-13' dataset we can see that CFA had better performance than SMOTE-based methods with respect to ROC curve (Figure 3).These results support our previous results which were showing that CFA can work as a successful technique for data augmentation to handling the class imbalanced problem.On the other hand, Figure 4 shows several examples of ROC curves where SMOTE variants (i.e., SMOTE, B-SMOTE, ADASYN, SVM-SMOTE, SL-SMOTE, SMOTE-RSB) do better than CFA.For example, when running a LR classifier on the 'PIMA' dataset we can see that SMOTE-based methods had better performance than CFA (Figure 4).

Why Does CFA Work?
In XAI, counterfactual methods have been found to create plausible, synthetic datapoints for explanatory purposes; indeed, the evaluative metrics in XAI show that these explanatory counterfactuals are generally valid, within-distribution and close to existing data-points.This experience in XAI is the backdrop and motivation for applying this counterfactual method to the data augmentation.As we saw earlier, initial tests on a crop-growth prediction problem showed that generated counterfactuals in the minority class improved performance, specifically dealing with the dataset drift caused by climate change (see [67]).This experience led to the present tests to determine the generality of these effects.As we can see, the CFA method seems to work well across wide range of datasets, ML models and different imbalance ratios.But why does it work so well?
From the climate example, our initial explanation was that CFA does well because it generates minority instances that are "counterfactual offsets" from known minority-instances.But, this account does not answer why the offsets tend to be useful.Our best account hinges on ideas from how case-based reasoning (CBR) systems operate.In CBR, target problems are solved by retrieving similar cases and (sometimes) adapting them to generate predictions.So, if I am trying to predict house-prices in a city and my CBR system is presented with a "3-bed apartment with 4bathrooms" and the closest retrieved case is a "3-bed apartment with 2-bathrooms", the system could have an adaptation rule than can bridges the gap between the historical case and the target case; for instance, there may be an adaptation rule that says "In general, an additional bathroom is worth $5k more".So, in a typical CBR-system, this rule would be applied to the retrieved case to bring it closer to the target case and improve the prediction.In CBR, adaptation rules were often hand-crafted but they may also be learned from analyses of feature-difference patterns between instances in the case-base [15,28,75]; so, the extra-bathroom rule could be learned from bathroomnumber differences found between historical instances (showing that they often lead to a $5k uplift in price, other features being equal).So, in CBR, these adaptation rules help to bridge holes in the case-base/dataset, by providing plausible transformations of known datapoints.
Counterfactuals are special case of an adaption rule; they capture the key feature-differences that lead to class changes across the boundary between majority and minority instances.So, when we apply them to majority instances to create synthetic minority instances, they stand a good chance of being plausible as they are based on prior local transformations.Though they lack generality (they are not created from multiple pairings of the same case-differences as is the case for learned adaption rules), perhaps they may work because they are so constrained and local.Remember, CFA only considers native counterfactual with <= 2 feature differences, so the relationship is highly constrained and specific to the instances that are already very similar (i.e., all other features are essentially identical).To put it simply, CFA delivers good counterfactual rules that work locally to generate plausible datapoints that are predictively useful.So, it looks like the highly-constrained nature of the 2 feature-difference is important to the success of the method.

What Are CFA's Limitations?
The flip-side to the success question is the failure one: namely, when do we think CFA will fail and what limitations might it encounter.The current experiments a version of CFA that performs well, so it is not immediately clear what would lead CFA to fail.However, there are several conditions under which CFA is likely to be less good, with respect to (i) the quality of dataset differences, (ii) the use of the 2-feature-difference constraint and (iii) the use of different tolerances.
Quality of the Dataset.Fundamentally, CFA depends on the set of native counterfactuals in the dataset for its success.To put this another way, there needs to be a rich and diverse set of good counterfactual pairings between majority and minority instances either side of a clear decision boundary.Without these counterfactuals the ability to generate synthetic datapoints will be severely hampered.Current indications are that at least 5% of the majority class need to be involved in these native counterfactuals, in order to provide a basis for generating minority instances from the 95% of the majority class.However, we have not systematically tested how changes in this percentage affect performance.What we do know is that for many datasets the current parameters for CFA deliver good performance, so this factor might be quite robust to disruption.
The 2-feature-difference Constraint.A key hyperparameter for CFA is the constraint that the native counterfactuals build from the dataset involve no more than two feature-differences.This is a strong constraint that was made originally on psychological grounds in XAI [39]; that is, it has been shown that people prefer sparse counterfactuals with 2-3 feature differences, as they are easier to comprehend [22,23].However, this rationale from XAI does not apply to data augmentation.
In data augmentation, the 2-feature-difference may work because it produces very minimally-different counterfactual pairs; so, they produce very simple adaptation rules in which most features remain the same between instances and a small number of features differ.These difference patterns may be more representative of valid instance-differences in the dataset and, hence, be more likely to produce good synthetic datapoints.
These is some evidence to support this proposition in prior work.Temraz et al [67] report that in pilot runs of their experiments they explored using 3-, 4-and 5-feature-difference counterfactuals but found they did not significantly improve predictive importance; that is, they were less likely to generate useful minority instances.We do not know whether similar results would be found for other datasets, though the fact that the 2-feature-difference constraint works here for 25 datasets with 9-12 features suggests that this constraint works quite generally.So, again, we would expect CFA to fail if higher numbers of featuredifferences were used in computing the native counterfactuals when applying the method.
The Importance of Tolerance.A final key parameter in CFA is that of tolerance.In finding matching-and difference-features between two instances for a native counterfactual, we apply a tolerance to the feature values.Specifically, we allow features to match if their values are within +/-10% of the standard deviation from the mean all the values for that feature.This tolerance was applied uniformly across all of our datasets.Keane & Smyth [39] used a more sophisticated tolerance scheme that tailored the tolerance to each dataset; they varied the tolerances for each feature until changes in classification of the original dataset arose and then chose a relative tolerance that produced no classification change.Obviously, without tolerance, fewer counterfactuals would be found and the generative benefits of them would likely diminish.
The Importance of the Decision Boundary.Fundamentally, like Borderline-SMOTE, CFA works with instances that are close to the decision boundary.So, clearly, the definition and nature of that decision boundary is critical to its successful performance.If the instances around the boundary are noisy and the boundary is less clearly-defined then CFA is likely to disimprove.
In this respect, it is interesting note that CFA does best using the k-NN and Random Forests models relative to the MLP and Linear Regression models, with the latter doing the worst (on AUC).

Conclusion
In this paper a novel oversampling method --Counterfactual Augmentation (CFA) --was proposed to handle the class imbalanced problem for binary classification tasks.CFA uses a case-based reasoning approach to generating synthetic counterfactuals in the minority class.The essence of this method is that it oversamples by adaptively combining actual feature-values from dataset instances rather than extrapolating/interpolating values between instances.The key discoveries made are: i) counterfactual methods developed for XAI can be usefully deployed to augment datasets, with synthetic cases in the minority class, that improve the performance of the ML models; (ii) this method can successfully introduce new synthetic minority examples by leveraging known counterfactuals in the dataset; and iii) this method can outperform many key benchmark SMOTE variants on a wide range of datasets with differing imbalance ratios using representative ML models.

Figure 1 :
Figure 1: Counterfactual Augmentation (CFA): An unpaired instance,  & (grey circle), finds a nearest neighbor,  (blue circle), taking part in a "good" native counterfactual-pair in the dataset, (, ) (pairing of blue circle and yellow box) and then uses the difference-features of the counterfactual-instance,  (yellow box), to generate a new synthetic counterfactual-instance,  & (green box), combining them with the matching-features of the original unpaired instance,  & (grey circle).The generated synthetic instance,  & (green box), is then added to the dataset to improve future prediction.

2 Step 3
For each unpaired instance,  & , from the majority class, find its nearest-neighbor paired instance, x, taking part in a native counterfactual, (, ): For each unpaired instance,  & , CFA uses a k-NN to find its nearest neighboring, , a paired instance involved in a native counterfactual pair, (, ).By definition,  & , belongs to the majority class and does not occur any native counterfactual pairs; notably, this means that all the synthetic datapoints generated by CFA come from these instances in the majority class that are not already counterfactually-related to instances in the minority class.Euclidean distance is used in finding these nearest neighbors: Euclidean Distance (ED) = de ( $ −  $ ) Transfer feature-values from  to  & and from  & o  & : Having identified a candidate native counterfactual, (, ), for  & , CFA generates a synthetic counterfactual instance in the minority class,  & , using feature-values from ′ and , such that: § For each of the difference-features between  and , take the values from  into the synthetic counterfactual case,  & .§ For each of the match-features between  and , take the values from  & into the new counterfactual case,  & .
10 base datasets were: § Abalone dataset: A multi-class dataset analyzed to find the age of abalone from physical measurements, consisting of 28 classes, that was modified using 1v1 and 1vR.§ Glass dataset: A multi-class dataset used to classify glass-types based on the chemical analysis, consisting of 7 classes, that modified using 1vR to treat the class '3' as the minority class, and all other classes are treated as the negative class.§ Yeast dataset: A multi-class dataset used to predict the cellular localization sites of proteins, consisting of 10 classes, that modified using 1v1 and 1vR.§ Pima Indians Diabetes dataset: A binary-class dataset used to predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.§ Phoneme dataset: A binary-class dataset used to distinguish between nasal and oral sounds.§ Vehicle dataset: The Vehicle data set is a multi-class dataset, with 4 classes.The problem is to classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette.Again, this dataset was modified using 1vR.So that the class "Van" is treated as the minority class, and all other classes are treated as the negative class.§ Ecoli dataset: This dataset is a multi-class dataset, with 8 classes.The problem is to classify Ecoli proteins using their amino acid sequences in their cell localization sites.§ Page-Blocks dataset: A multi-class dataset used to classify all the blocks of the page layout of a document that has been detected by a segmentation process.§ Wine Quality dataset (Red and White): Two datasets related to red and white vinho verde wine samples.The problem is to classify wine quality based on physicochemical tests.§ Poker dataset: A multi-class dataset, with 10 classes used to predict poker hands.The overall performance of each classifier was tested using k-fold cross validation (with k = 5).So, each dataset is randomly partitioned into 5 disjoint subsets, where each subset included approximately equal size of data; then, a single subset was retained as a test set with the remaining k-1 subsets being used as the training data.Different datasets have different-sized minority classes (see Table2for details).For each of the SMOTE methods, we split each dataset into training and validation folds.Then, on each fold we oversample the minority class, train the classifier on the training folds, and finally, validate the classifier on the remaining fold.For CFA, the native counterfactuals in the training data were computed (CF-Set) and then all the remaining unpaired majority-instances were run through CFA to create the synthetic counterfactual instances in the minority class; this augmented dataset was then used for testing.These generated datasets from CFA were compared with the original dataset (without data augmentation) and the datasets generated by SMOTE, B-SMOTE, ADASYN, SVM-SMOTE, SL-SMOTE and SMOTE-RSB.For our experiments, we oversample the minority class using each of the data augmentation methods until we have the same number of instances in each class.As a result, fully balanced datasets were created.

Figure 2 :
Figure 2: F1 values for the different conditions, across 25 datasets for the four classifiers (a) RF classifier, (b) k-NN, (c) LR, and (d) MLP

Figure 3 :
Figure 3: Selected examples for ROC curves where CFA outperformed SMOTE-based methods, obtained for the six methods using the four classifiers on different data sets.

Figure 4 :
Figure 4: Selected examples for ROC curves where SMOTE variants outperformed CFA, obtained for the six methods using the four classifiers on different data sets.
[52]ate-variable values >2 standard deviations of historical means).From a classification perspective, these "normal cases" were the majority class and the "climate-disrupted cases" are the minority class.They then used the instance-based counterfactual method to create new synthetic climate-disrupted cases and showed that the PBI-CBR model's performance specifically improved on predicting climate-disruptive events (in 2018) using these newly-created minority data-points.Interestingly, Temraz et al's experiments showed that that the instance-guided methods did better than optimization-methods in this problem domain (specifically, the DICE method[52]).However, this work only considers one specific problem domain, classifier, and dataset.It remains to be seen whether this counterfactual approach generalizes to other problem domains, classifiers and datasets; and, specifically, to datasets where class-imbalance problems arise.
Temraz, et al. [67]used the present instance-guided counterfactual method to generate synthetic data for a crop-growth prediction problem.Their problem domain involved a k-NN model -called PBI-CBR --for grass growth prediction that relies on an historical dataset of specific measurements of climate and grass growth on dairy farms in Ireland (N=70,091 data-points covering 2013-2018).The PBI-CBR model does reasonably well at predicting grass growth for individual farms in the coming week using this historical data.But, with climate change, there are an increasing number of climate-disruptive events, events that diverge significantly from the scenarios recorded in the historical data (e.g., extreme values for key weather variables like solar radiation or soil moisture).For example, in 2018 there was a significant drought across Europe, that effectively halted grass growth in Ireland during, what is usually, the peak-month of July (i.e., if soil moisture drops, then grass stops growing, indeed high solar radiation will burn grass).Accordingly, the PBI-CBR model does not do very well at predicting grass growth for these climatedisrupted months of 2018 because they are historically unique.Temraz et al. defined a climatebased class boundary in the PBI-CBR dataset, creating a division between "normal cases" (with climate-variable values within 2 standard deviations of historical means) and "climate-disrupted cases" (with

Table 2 :
Datasets & DataSet Variants Used in the Experiment (IR= Imbalance Ratio)

Table 3 :
[26]usion Matrix for ClassificationsReceiver Operating Characteristic curves (ROC curve) were reported as they are often used to evaluate classification models for imbalanced data sets[26].The ROC curve is a two-dimensional graph in which true positive (TP) rate is plotted on the y-axis and false positive (FP) rate is plotted on the x-axis.One advantage of ROC curves is that they are not affected by the class ratio between minority and majority instances in the datasets (see Figures for results).Area Under Curve (AUC)

Table 4 :
AUC values for the RF classifier for each Data Augmentation Method

Table 5 :
AUC values for the k-NN classifier for each Data Augmentation Method

Table 6 :
AUC values for the LR classifier for each Data Augmentation Method

Table 7 :
AUC values for the MLP classifier for each Data Augmentation MethodNotably, if we assess overall performance by noting occasions for a given method when the Precision score is highest, we see that CFA scores best (see Table8): in 70.0%(70/100) of cases it has the highest Precision score, as opposed to 24% (24/100) for SMOTE-RSB, 14% (14/100) for Base-

Table 8 :
The number of datasets for each method showing the highest Precision and Recall scores for a given method (SMOTE, B-SMOTE, ADASYN, SVM-SMOTE, SL-SMOTE, SMOTE-RSB and CFA) on a selected classifier