Potential Anchoring for imbalanced data classification

Data imbalance remains one of the factors negatively affecting the performance of contemporary machine learning algorithms. One of the most common approaches to reducing the negative impact of data imbalance is preprocessing the original dataset with data-level strategies. In this paper we propose a unified framework for imbalanced data over- and undersampling. The proposed approach utilizes radial basis functions to preserve the original shape of the underlying class distributions during the resampling process. This is done by optimizing the positions of generated synthetic observations with respect to the potential resemblance loss. The final Potential Anchoring algorithm combines over- and undersampling within the proposed framework. The results of the experiments conducted on 60 imbalanced datasets show outperformance of Potential Anchoring over state-of-the-art resampling algorithms, including previously proposed methods that utilize radial basis functions to model class potential. Furthermore, the results of the analysis based on the proposed data complexity index show that Potential Anchoring is particularly well suited for handling naturally complex (i.e. not affected by the presence of noise) datasets.


Introduction
Data imbalance [1,2,3] occurs in the classification problem domain whenever the number of observations from one of the classes (majority class) is higher than the number of observations from one of the other classes (minority class). Existing learning algorithms are typically susceptible to the presence of data imbalance, displaying bias towards the majority class. The negative impact of data imbalance on the classifiers performance is further exacerbated by inherent data difficulty factors such as class overlap, small disjuncts, presence of noise, and insufficient number of training observations [4,5,6,7].
The problem of data imbalance is ubiquitous in practical applications, affecting domains such as cancer malignancy grading [8], industrial systems monitoring [9], fraud detection [10], behavioral analysis [11] and cheminformatics [12]. Furthermore, data imbalance typically leads to the more costly type of error, for instance by inducing false negatives in the medical problem domain. Because of that, imbalanced data classification remains focus of intense scientific effort.
One of the most prevalent approaches for dealing with data imbalance are the data-level algorithms: methods that reduce imbalance either by creating new minority class observations (oversampling) or reducing the number of majority class observations (undersampling). In particular in the case of oversampling, this usually requires generation of synthetic observations to prevent overfitting. Existing oversampling strategies typically modify class distribution, focusing the process of generation of observations in specific regions based on the adapted strategy [13,14,15,16,17].
In this paper we propose a novel approach to imbalanced data resampling that is based on the idea of preserving the shape of the original class distributions. The proposed approach frames the resampling problem as an optimization of positions of generated observations with respect to the potential resemblance loss, a tool for evaluating the relative shape of class distributions. The main contributions of this paper can be summarized as follows: • Proposition of potential resemblance loss, which utilizes radial basis functions to evaluate the relative distribution shape for two sets of observations.
• Integration of potential resemblance loss into a unified over-and undersampling framework.
• Proposition of data difficulty index, a function measuring the complexity of the considered dataset.
• Experimental comparison of the proposed approach with state-of-the-art resampling strategies.
• Examination of factors influencing the relative performance of the proposed approach.
The rest of this paper is organized as follows. In Section 2 we discuss the relevant scientific contributions. In Section 3 we introduce the concept of potential resemblance loss, and discuss how it can be utilized during imbalanced data resampling. In Section 4 we introduce the concept of data difficulty index, a measure of dataset complexity that will be later utilized to identify the areas of applicability of the proposed algorithm. In Section 5 we describe the conducted experimental study and the observed results. Finally, in Section 6 we present our conclusions.

Related work
Two main approaches to dealing with data imbalance can be distinguished. First of all, data-level methods, the aim of which is to modify the training data to artificially reduce the degree of imbalance, either by creation of new minority class observations (oversampling) or by removal of the majority class observations (undersampling).
By far the most prevalent oversampling paradigm are neighborhood-based algorithms originating from the Synthetic Minority Over-sampling Technique (SMOTE) [18]. SMOTE is considered a cornerstone for the contemporary imbalanced data resampling [19], having inspired numerous other approaches based on the idea of interpolating nearest minority class observations during the oversampling. These methods typically focus the oversampling in a specific regions, modifying the original class distribution in the process. Notable examples include Borderline-SMOTE (Bord) [13], Safe-Level-SMOTE [14], Adaptive Synthetic Sampling (ADASYN) [15], and MWMOTE [16].
However, despite their prevalence, SMOTE-based techniques have several shortcomings that limit their usefulness on datasets affected by various data difficulty factors. The most significant drawbacks of SMOTE are: that it assumes a homogeneous minority class clusters, and that it does not consider the majority objects during the resampling process. Numerous attempts have been made to address the aforementioned issues, leading to development of novel oversampling approaches [20,21]. It was also demonstrated that SMOTE performs poorly on highly dimensional data [22,23].
One of the categories of undersampling algorithms are cleaning strategies, the aim of which is removal of a subset of existing majority class observations. This is typically done based on a heuristic strategy of finding the observations inconsistent with the remainder of the data. Early examples of cleaning strategies include methods such as Tomek Links [24] and Edited Nearest-Neighbor rule [25]. These methods are often used in combination with SMOTE oversampling. Some more recent cleaning strategies can also be distinguished. For instance, Anand et al. [26] proposed sorting the undersampled observations based on the weighted Euclidean distance from the positive samples. Smith et al. [27] advocated for using the instance hardness criterion to determine the order of observation removal.
Clustering-based algorithms constitute another popular family of undersampling algorithms. This type of methods employs clustering, either to replace the original data with a completely novel set of observations [28], or to determine the most representative subset of original observations [29]. Clustering-based undersampling was also successfully used to form classifier ensembles [30], an idea that was further extended in form of evolutionary undersampling [31] and boosting [32].
It is also worth mentioning that some data-level approaches utilizing radial basis functions can be distinguished in the literature. SWIM framework [33] used them to model the density of well-sampled majority class observations in the case of extreme imbalance, with the goal of constraining the oversampling regions. Radial-Based Oversampling (RBO) [17], on the other hand, used radial basis functions to guide oversampling towards regions of low absolute potential, which can be interpreted as placed close to the decision border. This concept was further extended to the undersampling setting with Radial-Based Undersampling (RBU) [34], where it determined the order of observation removal.
The differences between over-and undersampling algorithms have been examined in the literature, both from the theoretical, as well as the experimental standpoint. It was recognized early that over-and undersampling, despite both being a data-level approaches, face unique challenges. In particular, while the undersampling strategies are mostly concerned with elimination of information loss due to the removal of observations, the main concern for oversampling algorithms is reducing the impact of overfitting [35].
Furthermore, some studies concerned with the question of which of the algorithms yields better performance can be distinguished. Notably, some experimental results seem to indicate that undersampling can be a preferable approach for data affected by high levels of noise [36]. Another study concludes that an important factor affecting the relative performance is the level of imbalance, and that oversampling tends to perform better on a severely imbalanced datasets [37]. Finally, in a recent study it was demonstrated that combining over-and undersampling with a properly chosen ratio can be beneficial to the final performance [38]. However, there does not seem to be a clear consensus on which family of the algorithms outperforms the other, and in practice an experimental evaluation is usually required to determine the approach optimal for a given dataset.
Finally, the second category of methods for handling data imbalance constitutes of algorithm-level strategies. This type of methods modifies the training procedure of traditional classification algorithms to better account for the data imbalance, and to reduce its negative impact on the minority class performance. Notable examples of algorithm-level solutions include: kernel functions [39], splitting criteria in decision trees [40], and modifications of the underlying loss function to make it cost-sensitive [41].
However, contrary to the data-level approaches, algorithmlevel strategies require a specific choice of classification algorithm, making them less flexible than data-level approaches. Still, in many cases, they are reported to lead to a better performance [42].

Potential Anchoring
In this paper we propose a novel, unified framework for imbalanced data over-and undersampling. The proposed framework utilizes radial basis functions to measure the resemblance of class distributions between the original and synthesized observations. In the remainder of this section we outline the motivation behind the proposed approach, introduce the concept of potential resemblance loss, and describe how can it be integrated into both over-and undersampling procedure.

Motivation
Early attempts at dealing with data imbalance utilized random oversampling, a process during which identical copies of the existing observations are created. However, it was soon realized that creation of exact duplicates of the existing observations can lead to overfitting of certain classifiers, which motivated the proposal of SMOTE [18]. Contrary to the exact duplication of the existing observations, SMOTE creates synthetic observations by interpolating the existing ones. While this process was empirically shown to outperform random oversampling, it can alter the produced minority class distribution. This is due to the fact the density of the original observations is not guaranteed to be identical to that of the generated ones.
Since its inception SMOTE became a cornerstone for the contemporary imbalanced data classification [19], motivating numerous extensions to the original approach. A common theme, shared by a majority of these SMOTEderived methods, is a mechanism of focusing the oversampling in certain regions. Examples of such approaches include Borderline-SMOTE [13] which, as the name indicates, focuses resampling near the borderline region; Safe-Level-SMOTE [14], in which observations are generated near the computed safe regions; ADASYN [15], which produces higher quantity of synthetic observations around difficult observations; and MWMOTE [16], also focusing on the difficult observations. More recently, density-based approaches utilizing radial basis functions, such as Radial-Based Oversampling [17] and Sampling With the Majority Class [33], also display the behavior of modifying the underlying class distribution. All of the aforementioned resampling strategies are based on different, often contradictory, ideas on where the resampling process should be focused. While all of them have their own niches of outperformance, out of necessity they are specialized, and it is often not clear which method, if any, is preferred in a general case.
Instead of using an ad-hoc strategy of boosting specific regions of data space, in this paper we propose taking the approach of preserving the original shape of underlying class distribution. Specifically, we achieve that by treating the generated synthetic observations as optimization parameters, which are positioned to minimize the difference between the potential of the original and resampled observations, with a regularization constraint added to prevent overfitting during oversampling.

Potential resemblance loss
We base our approach on the concept of class potential. Potential functions were previously used in the context of imbalanced data resampling in [17]. Given a collection of observations X , and a spread of a single radial-basis function γ, potential in a given point in space x can be defined as Intuitively, potential can be viewed as a measure of cumulative proximity of x to X, with higher value of potential indicating that more observations from X lie in a close proximity to x. Of particular interest to our discussion are majority and minority class potential, computed on, respectively, a collection of majority class observations X maj and minority class observations X min . Class potential can be viewed as a measure of density of observations from that class.
The values of potential function are not, however, bound to any specific range, making it difficult to compare the relative shape of potential computed with respect to two different collections of observations. To mitigate this issue we propose a normalized potential function Ψ. This function computes the potential for k anchor points A, and returns a vector of k normalized potentials. More formally, we define the normalized potential function as Due to the non-negativity of Φ, the values of Ψ are also non-negative and range from 0 to 1. The normalized potential function describes the relative density of observations in any given anchor point A i . This property makes it possible to directly compare the outputs of Ψ computed with respect to two different collections of observations, even if the collections differ in size, as will be the case during resampling.
Finally, based on the concept of normalized potential we define the potential resemblance loss. Given a collection of original observations X , k anchor points A, collection of prototypes, that is generated observations the position of which we wish to optimize, P, their starting positions P 0 , radial basis function spread γ, and regularization coefficient λ, we define the potential resemblance loss as The left-side term of the equation is a mean squared error between the normalized potential computed with respect to X and P, and as such measures the difference in relative shape of the potential produced by these two collection of observations. The right-side term is a regularization term that measures the displacement of prototypes P from their starting positions. If P 0 is created by a random sampling of X , as will be the case in the proposed approach, this term prevents the algorithm minimizing L from degenerating to random oversampling, since prototypes P will be displaced from their original positions. It is worth noting that even though in principle we would like to penalize the placement of P close to any of the observations from X , in our experiments considering only the starting positions P 0 was sufficient to prevent the overfitting, at the same time being more computationally efficient.

Algorithm
Being equipped with a potential resemblance loss L, we can then formulate the problem of imbalanced data resampling as the optimization of prototype point positions P with respect to L. While the proposed approach is motivated from the point of view of oversampling, it is also easily applicable to the undersampling, as the same principle of preserving the original class density can be applied in both cases. The main difference between the two is that while during the oversampling we can preserve the original minority observations and use prototypes P as a collection of additional, synthetic observations, during the undersampling we will instead replace the original majority observations with a smaller collection of prototypes preserving the original class potential.
We present the pseudocode of the proposed Potential Anchoring (PA) algorithm in Algorithm 1. The method combines over-and undersampling up to the point of achieving balanced class distribution, with the ratio of imbalance eliminated with either over-or undersampling treated as a parameter. First, k anchor points, with respect to which normalized potential will be calculated, are generated via clustering of the collection of original observations X . Second, the prototypes are initialized by randomly sampling with replacement from the collection of observations of a given class. Importantly, small random jitter is afterwards introduced to break the symmetry during the optimization. Finally, the prototypes are then optimized with respect to the potential resemblance loss function L, separately for the majority and the minority class. This loss function is penalized with a regularization coefficient λ in the case of oversampling. Throughout the conducted experiments we used k-means clustering to generate the anchor points. We also leveraged the differentiability of L and conducted the optimization using Adam optimizer [43].
A particular case of PA involves eliminating the imbalance solely by either oversampling (PAO) or undersampling (PAU). We illustrate the concept of class potential in both cases in Figure 1. As can be seen, PA generates synthetic observations, the potential of which resembles the original despite the fact that it is being anchored in a small number of points. Secondly, we illustrate the impact of regularization coefficient λ on the behavior of PAO in Figure 2. As can be seen, higher values of λ lead to lower similarity between the original and generated potential shape, and higher spread of synthesized observations. Disabling regularization leads to minimal translation of the prototypes, and behavior closely resembling random oversampling. Finally, we illustrate the behavior of PA, with both over-and undersampling used, and regularization enabled, in Figure 3.

Data difficulty index
An important aspect of experimental studies involving imbalanced data resampling algorithms, or machine learning algorithms in general, is identification of their areas Algorithm 1 Potential Anchoring Input: collection of original observations X divided into majority X maj and minority X min observations Parameters: ratio of imbalance eliminated with oversampling, number of anchor points k, number of iterations, radial basis function spread γ, oversampling regularization coefficient λ, learning rate α, random jitter used for initialization Output: collection of resampled observations X 1: function PA(X maj , X min , ratio, k, iterations, γ, λ, α, ): 2: A ← k anchor points obtained by clustering min ← n P AO prototypes randomly selected with replacement from X min 6: P 0 maj ← n P AU prototypes randomly selected with replacement from X maj 7: P min , P maj ← P 0 min , P 0 maj with added random jitter 8: for i in 1..iterations do 9: perform optimization step on P min w.r.t. L(X min , A, P min , P 0 min , γ, λ) using learning rate α 10: end for 11: for i in 1..iterations do 12: perform optimization step on P maj w.r.t. L(X maj , A, P maj , P 0 maj , γ, 0) using learning rate α 13: end for 14: X ← X min ∪ P min ∪ P maj 15: return X of applicability. According to the "no free lunch" theorem [44], we should not expect any single algorithm to achieve an optimal performance in every considered problem. Instead, we can identify the conditions under which the considered algorithm tends to outperform the reference methods. This serves two goals. First, to provide a rule of thumb for a practitioner deciding whether the usage of our algorithm is sensible. Second, to guide the future research, by focusing on the detected strengths and weaknesses of the proposed method during further development.
One of the most important factors affecting the performance during imbalanced data classification is the complexity of the considered datasets. Recent studies [7,42] recognize that data imbalance, by itself, does not have to pose a challenge for learning algorithms. Instead, it exacerbates the negative impact of other data difficulty factors, such as small sample size, presence of disjoint and overlapping data distributions, and presence of outliers and noisy observations. It is therefore beneficial to evaluate the impact of said factors on the performance of the proposed algorithms.
A recent methodology proposed by Koziarski [34] examined relation between the proportion of difficult observations and relative performance observed for a given algorithm. This methodology utilized the categorization introduced by Napierała and Stefanowski [45], in which minority observations are assigned one of four categories based on their nearest neighborhood. Specifically, the category is assigned based on the number of nearest neighbors from the same class: safe in case of 4 to 5 neighbors from the same class, borderline in case of 2 to 3 neighbors, rare in case of 1 neighbor, and outlier when there are no neighbors from the same class. The proportion of observations from any given category was afterwards correlated across multiple datasets with the rank achieved by the considered algorithm on a given dataset. Afterwards, conclusions could have been drawn about either increase or decrease in the relative performance, depending on the proportion of observations from a given category. This analysis was conducted separately for each observation category. One shortcoming of this approach was caused by the fact that not every dataset consists of observations from each category, which could have a confounding effect on the results of the analysis. To give an example, we can consider two cases of datasets, the first consisting entirely of rare minority observations, and the second entirely of outliers. While both datasets can clearly be classified as difficult, the first one will not be treated as such during the examination of outlier percentage (since there are none), and the second during the examination of percentage of rare observations. This independence of analyses conducted with respect to different observation categories can have an unwanted effect on the results.
To address this issue we propose extending the concept of aforementioned categorization into a data difficulty index (DI), a function that measures the average proportion of majority neighbors for each minority observation. More formally, let us define the collection of all observations by X , the collection of n min minority observations by X min , ith minority observation as X We can then define the difficulty index of a collection of observations X , parametrized by m, the number of considered nearest neighbors of each minority observation, as min , X , j)). (5) We present the impact of dataset characteristics on the calculated difficulty index in Figure 4. Similar to the original categorization, we used 5-neighborhood for its calculation. As can be seen, very low values of DI are obtained even for highly imbalanced datasets, as long as the class distributions can be clearly separated. However, as the entropy of the data increases, so do the obtained values of DI.

Experimental study
To empirically evaluate the usefulness of the proposed PA algorithm we conducted a series of experiments, the aim of which was answering the following research questions:  for each dataset we computed the data difficulty index (DI) using m = 5 nearest neighbors. Prior to resampling and classification, each dataset was preprocessed: categorical features were encoded as integers, and afterwards all features were standarized by removing the mean and scaling to unit variance. Classification. Four different classification algorithms, representing different learning paradigms, were used throughout the experimental study: CART decision tree, k-nearest neighbors classifier (KNN), support vector machine (SVM) and multi-layer perceptron (MLP). The implementations of the classification algorithms provided in the scikit-learn machine learning library [47] were utilized. Used hyperparameters of the classification algorithms were presented in Table 2.
Throughout the experimental study all resamplers were used up to the point of achieving a balanced class distribution.
Evaluation. For every dataset we reported the results averaged over the 5 × 2 cross-validation folds [57]. Throughout the experimental study we reported the values of precision, recall, AUC and G-mean. We intentionally excluded F-measure from the considered performance metrics, since it was previously shown [58] that F-measure is usually more biased towards the majority class than AUC and G-mean, making AUC and G-mean more suitable for assessment of performance in the imbalanced data classification task Implementation and reproducibility. The experiments described in this paper were implemented in the Python programming language. Complete code, sufficient to repeat the experiments, was made publicly available at 1 . In addition to the code we also provided the cross-validation folds used during the experiments, as well as the raw result files, which can be used for further analysis.

Comparison with previous radial-based strategies
As previously discussed, similar to RBO and RBU algorithms, PA relies on the concept of class potential to estimate the local density of observations from a given class. Even though both families of algorithms use class potential differently during the resampling process, a natural question is whether the approach proposed in this paper, by itself, improves the performance during both overand undersampling. To answer this question we began our analysis with a pairwise comparison of both PAO and RBO, as well as PAU and RBU. To asses the statistical significance of this comparison we conducted a two-sided Wilcoxon signed-rank test, the results of which were presented in Table 3 for the pair of oversampling algorithms, and in Table 4 for the pair of undersampling algorithms. Furthermore, boxplots with respect to G-mean achieved by all four methods were presented in Figure 5.
As can be seen, with the sole exception of AUC of undersampling algorithms combined with the CART classifier, the results of this pairwise comparison were statistically significant. Furthermore, with the only exception of the comparison of undersampling algorithms combined with the CART classifier (for both AUC and Gmean), resampling strategies based on the potential anchoring paradigm outperformed the radial-based strategies, as demonstrated by a better average performance and a higher number of datasets on which potential anchoring achieved better results. It is also worth noting that the improvement in performance was stronger in the case of oversampling than in the case of undersampling, as evidenced     by lower p-values and higher number of datasets on which PAO outperformed RBO. In general, the observed results lead to a conclusion that potential anchoring paradigm, on average, produces an improvement in performance compared to the radial-based strategies, both for over-and undersampling.

Combining over-and undersampling
The second extension of previous radial-based approaches proposed in this paper is combining over-and undersampling with the goal of improving the performance of individual algorithms. To evaluate the hypothesis that using the combined over-and undersampling leads to an improved performance we conducted an experiment in which we modulated the ratio of imbalance alleviated by oversampling. Specifically, we considered the values of ratio parameter ∈ {0.0, 0.1, 0.2, ..., 1.0}, with higher values of ratio parameter corresponding to stronger oversampling and weaker undersampling, both applied together up to the point of achieving balanced class distribution. In particular, ratio equal to 1 indicated that only the oversampling was used (or, in other words, PA degenerated to PAO), and ratio equal to 0 indicated that only the undersampling was used (PA degenerated to PAU).
The results, averaged over all of the datasets, were presented in Figure 6. First of all, as can be seen, the ratio impacted both precision and recall monotonically, with the precision increasing with the increase of the ratio, and the recall decreasing. In other words, stronger oversampling produced better precision of the predictions at the cost of their recall, and stronger undersampling had the opposite effect. This trend was consistent for all of the considered classification algorithms. Secondly, perhaps more importantly, the ratio also affected the performance measured with respect to the combined metrics, that is AUC and G-mean. However, contrary to the results observed for precision and recall, this trend was not monotonic across all of the considered ratio values, but instead displayed a peak at the ratio equal to 0.1, shared for all of the considered classifiers and both combined metrics. This was the case regardless of the exact shape of the performance curve in any specific case, and in particular regardless of which of the PAO and PAU displayed a better average performance for a considered classifier and metric combination. This leads to a conclusion that the strategy of eliminating the imbalance by a combination of over-and undersampling is beneficial to the performance of techniques based on the potential anchoring paradigm when strong undersampling is combined with a weak oversampling. This is likely a case due to achieving better precision-recall tradeoff with respect to the combined metrics.

Comparison with reference resampling strategies
In the next stage of the experimental study we compared the performance of PA to that of the reference methods. To assess the statistical significance of this comparison we employed Friedman test combined with Shaffer's post-hoc. We reported the results at the significance level α = 0.10. We present the summary of this statistical comparison, containing average ranks achieved by each method as well as the indication of the cases in which statistically significant differences were observed, in Table 5. Furthermore, we also present boxplots illustrating the performance of each method with respect to G-mean in Figure 7.
Several observations can be made based on the presented results. First of all, compared to the reference methods PA achieved significantly better recall at the cost of significantly worse precision, indicating stronger bias towards the minority class than the reference methods. When combined performance metrics were considered, PA achieved particularly strong performance when combined with SVM and MLP classifiers, in both cases achieving the highest average ranks for both AUC and G-mean. This outperformance was statistically significant: in the case of SVM, PA achieved significantly better results with respect to at least one of the combined metrics in 8 out of 9 cases (with the exception of LVQ-SMOTE), and in the case of MLP in 7 out of 9 cases (with the exception of LVQ-SMOTE and RBO). Statistically significant outperformance was also observed, to a lesser extent, in the case of CART and KNN classifiers, when compared to 5 reference methods for CART and 3 reference methods for KNN. Importantly, only in a single case of AUC measured for CART classifier did PA achieve a significantly worse performance than the reference method. Finally, as can be seen on the presented boxplots, PA tended to achieve visibly higher minimum performance than the reference methods, indicating its usefulness in a general case.

Examination of factors influencing the performance
of PA Finally, in the last stage of the conducted experiments we tried to examine to what exactly the outperformance of PA can be attributed. To this end we conducted two experiments. In the first of them we examined the relation between the dataset characteristics and relative performance observed for PA. Specifically, we used data difficulty index (DI) to measure the complexity of a given dataset, and correlated it with the rank achieved by PA on that dataset. We present the Pearson correlation coefficients of these two variables in Table 6, and the scatterplots illustrating this relationship in Figure 8. As can be seen, there is a statistically strong correlation between DI and relative performance of PA. This trend is consistent across the classification algorithms and applies to both AUC and G-mean. Interestingly, in contrast to the results presented in Section 5.4, for two of the considered classifiers a statistically significant improvement in both relative precision and recall was observed for datasets with higher DI. This means that for the more complex datasets not only recall remained higher, but the precision also started to improve (relative to reference methods). Overall, the observed results indicate that PA is particularly well-suited for being    applied to complex datasets. This was further illustrated in Figure 9, where we once again presented boxplots illustrating the performance of individual methods, this time on a subset of 31 datasets for which DI was equal to or higher than 0.6 (a median DI value observed for all of the considered datasets). As can be seen, the differences between PA and the reference methods are more clearly pronounced compared to those presented previously for all of the datasets in Figure 7.
While we were able to show that there is a relation between the complexity of a dataset and relative performance of PA compared to the reference methods, it is not clear to what type of adversity PA is actually resilient. Broadly speaking, we can consider two causes for data complexity. The first one is natural, where complicated decision boundary is required to properly discriminate the data, but the existing observations faithfully describe the underlying class distribution. The second one is artificial, where complexity is caused by a high level of noise, occurring due to factors such as measurement errors, labeling errors, etc., and the observed data does not properly represent real class distribution. It is not clear which of these two factors are the source of data complexity in the considered benchmark datasets.
To evaluate the impact of the second of the aforementioned sources of complexity we conducted an experiment, during which we artificially introduced noise to the original datasets. Specifically, we considered the case of label noise, in which we randomly switched the label of the original majority class observation to the minority class with a probability equal to the set noise level. During this experiment we modulated the values of noise level ∈ {0.0, 0.04, 0.08, ..., 0.2} and recorded the average performance of different resampling methods. We present the observed results in Figure 10. As can be seen, contrary to what could be assumed based on the previously observed relation between data complexity and relative performance, PA does not display resilience to noise. In fact the opposite is true -PA is highly susceptible to the presence of noise, with average performance dropping at a  rate significantly higher than the reference methods. This trend was particularly noticeable in the case of CART and KNN classifiers, which can be a possible explanation for a relatively better performance displayed by PA in combination with SVM and MLP, as described in Section 5.4. Overall, the observed results indicate that while PA seems to outperform other methods on a naturally complex datasets, it is at the same time very prone to the presence of noise. This suggests a rule of thumb for the practical use, according to which it is not recommended to use PA on datasets with known presence of noise (i.e. datasets the labeling of which is highly subjective or prone to error). Furthermore, it suggests that combining PA with other types of data preprocessing, in particular noise removal, might be a feasible direction for further research.

Lessons learned
Based on the presented experimental results we can now attempt to answer the research questions asked at the beginning of this section. RQ1: How do PAO and PAU compare with the previously proposed radial-based resampling strategies? Potential anchoring approach significantly outperformed the previously proposed radial-based strategies, both in the form of over-and undersampling, with stronger differences observed in the case of oversampling. This indicates the usefulness of the proposed approach of preserving the potential shape, and shows that the demonstrated outperformance of PA is not only due to combining the over-and undersampling, but also due to the individual usefulness of both PAO and PAU. RQ2: Is it possible to improve the individual performance of PAO and PAU by combining over-and undersampling?
In the conducted experiments we achieved the best performance by combining both over-and undersampling, with the imbalance eliminated in a small proportion by oversampling and a large proportion by undersampling. This trend was consistent across the classification algorithms and the performance metrics, regardless of whether the over-or undersampling achieved a better stand-alone performance in a particular case. It is worth mentioning that this trend was consistent with the results presented in previous studies [38], indicating that it might be generally applicable to the resampling algorithms. RQ3: How does PA compare with state-of-the-art resampling strategies? PA outperformed the considered resampling algorithms, in particular when combined with either SVM or MLP classifier, for which the proposed method achieved highest average ranks and statistical significance in comparison with a majority of the considered methods. This was achieved by obtaining a significantly better recall of the predictions at the expense of their precision.
RQ4: Under what conditions does PA outperform other resampling algorithms? PA achieved the best performance, relative to the reference methods, on difficult datasets, indicating that the proposed approach of preserving the potential shape is particularly well suited for handling complex data. However, our experiments also indicate that PA is as the same time particularly susceptible to the presence of noise, which significantly reduces its performance. Since presence of noise can be difficult to differentiate from natural data complexity without a priori knowledge about problem domain, this can pose a challenge for a practical applicability of the method.

Conclusions
In this paper we proposed a novel approach for handling data imbalance in a manner that preserves the shape of the original class distribution. The proposed approach was utilized in both over-and undersampling in a unified framework. Furthermore, we proposed a measure of imbalanced datasets complexity, which was later utilized to identify the areas of applicability of the proposed approach. The results of our experiments indicate that Potential Anchoring outperforms the considered state-of-theart resampling strategies, in particular in combination with SVM and MLP classifiers. This outperformance is strongest on a naturally complex dataset. At the same time, however, PA was shown to be susceptible to the presence of noise.
A promising direction for further research is developing mechanisms for reducing the negative impact of noise on the algorithms performance. This can include strategies of pre-and post-processing the data to remove the suspicious observations, both original and generated by PA.
Alternatively, a modification to the proposed potential resemblance function can also be utilized. Finally, another research direction worth considering is translating the approach to a big data domain: due to the fact that the approach uses gradient-based algorithms for the optimization, it is feasible to conduct the optimization in a batch mode. Further research on how it would affect the performance is, however, necessary.