Radial-Based Undersampling for Imbalanced Data Classification

Data imbalance remains one of the most widespread problems affecting contemporary machine learning. The negative effect data imbalance can have on the traditional learning algorithms is most severe in combination with other dataset difficulty factors, such as small disjuncts, presence of outliers and insufficient number of training observations. Said difficulty factors can also limit the applicability of some of the methods of dealing with data imbalance, in particular the neighborhood-based oversampling algorithms based on SMOTE. Radial-Based Oversampling (RBO) was previously proposed to mitigate some of the limitations of the neighborhood-based methods. In this paper we examine the possibility of utilizing the concept of mutual class potential, used to guide the oversampling process in RBO, in the undersampling procedure. Conducted computational complexity analysis indicates a significantly reduced time complexity of the proposed Radial-Based Undersampling algorithm, and the results of the performed experimental study indicate its usefulness, especially on difficult datasets.


Introduction
The problem of data imbalance [1,2,3] occurs in the classification task whenever the number of observations belonging to one of the classes, the majority class, exceeds the number of observations belonging to one of the other classes, the minority class. Traditional classification algorithms are susceptible to the presence of imbalanced data, and tend to display a bias towards the majority class at the expense of the capability of minority class discrimination. This negative effect on the classification performance is further exacerbated by a presence of additional dataset difficulty factors, such as small disjuncts [4] or insufficient number of training observations [5], that can lead to model overfitting.
Most real datasets exhibit some degree of imbalance that can influence the classification process. Data imbalance heavily impacts many practical domains, such as cancer malignancy grading [6,7], industrial systems monitoring [8], fraud detection [9], behavioral analysis [10] and cheminformatics [11]. As a result, imbalanced data classification remains an active area of research [12,13,14,15]. Numerous approaches to mitigating the negative impact of data imbalance have been proposed in the literature. In particular, a family of data-level methods can be distinguished. Data-level methods manipulate with the training data to make it more suitable for classification by traditional learning algorithms, either by increasing the num-ber of minority observations (oversampling) or reducing the number of majority observations (undersampling).
Perhaps the most widespread paradigm of imbalanced data resampling are the neighborhood-based algorithms based on SMOTE [16]. However, SMOTE and many of its derivatives are susceptible to the presence of difficulty factors such as small disjuncts, outliers and small number of minority observations. Recently, a novel method based on the concept of mutual class potential, Radial-Based Oversampling [17], has been proposed with the goal of avoiding some of the pitfalls of SMOTE.
In this paper we investigate the possibility of extending the concept of mutual class potential to the undersampling procedure, with the aim of preserving some of the performance gains offered by using the potential to guide the resampling, while simultaneously reducing the computational complexity of the algorithm. The rest of this paper is organized as follows. In Section 2 we present the related work on the neighborhood-based oversampling algorithms, guided undersampling techniques, and the categorization of the minority object types. In Section 3 we describe the proposed method and discuss its computational complexity. In Section 4 we describe the conducted experimental study and the observed results. Finally, in Section 5 we present our conclusions.

Related Work
In this section we discuss the research relevant to the approach proposed in this paper. We begin with a brief description of the most prevalent paradigm of imbalanced data oversampling, neighborhood-based methods, and highlight their shortcomings which originally inspired Radial-Based Oversampling. Afterwards we describe the related research on guided undersampling strategies and briefly outline the most notable algorithms. Finally, we summarize the existing research of minority object type categorization, used later in this paper to identify the areas of applicability of the proposed method.

Neighborhood-based oversampling algorithms
The most fundamental choice during the design of both oversampling and undersampling algorithms for handling data imbalance is the question of defining the regions of interest: the areas in which either the new instances are to be placed, in case of oversampling, or from which the existing instances are to be removed, in case of undersampling. Besides the random approaches, probably the most prevalent paradigm for the oversampling are the neighborhoodbased methods originating from Synthetic Minority Oversampling Technique (SMOTE) [16]. The regions of interest of SMOTE are located between any given minority observation and its closest minority neighbors: SMOTE synthesizes new instances by interpolating the observation and one of its, randomly selected, nearest neighbors. SMOTE can be considered a cornerstone for the majority of the existing oversampling strategies [18,19]. Numerous extensions of the original algorithm were proposed, with the most notable including: Borderline-SMOTE [20], focusing on the borderline instances, placed close to the decision border; Adaptive Synthetic Sampling (ADASYN) [21], individually adjusting the oversampling ratio based on the difficulty of the given observation; and Safe-Level-SMOTE [22] and LN-SMOTE [23], limiting the risk of placing synthetic instances inside the regions belonging to the majority class. However, despite their prevalence, neighborhood-based approaches have their own shortcomings that can affect the suitability of synthesized observations for improving classification. Most importantly, in the basic variant SMOTE does not utilize the information about the distribution of the majority class objects: the regions of interest are based solely on the position of the minority observations. This can potentially lead to synthesizing minority observations overlapping the clusters of majority observations for datasets displaying factors such as a small number of minority objects, disjoint data distributions, or presence of the outliers, which was illustrated in Figure 1. Classifiers trained on datasets resampled in that way can display, possibly unjustified, bias towards the minority class and resulting decreased performance. While some attempts at limiting the described deficiency of SMOTE have been made, such as the previously mentioned extensions, Safe-Level-SMOTE and LN-SMOTE, or combining the oversampling using SMOTE with later cleaning with either Tomek links [24] or Edited Nearest-Neighbor rule [25], a further research into the methods explicitly using the information about the distribution of both classes is required. Figure 1: An example of a difficult dataset for neighborhood-based methods, displaying factors such as a small number of minority objects, disjoint data distributions, or presence of the outliers. On the left: original data distribution. On the right: dataset after oversampling with SMOTE, with the generated observations highlighted.

Guided undersampling strategies
Similar to the case of oversampling, finding the regions of interest, in the case of undersampling indicating which observations are to be discarded, is essential choice in the algorithm design process. Besides the random methods, over the years a number of guided undersampling strategies was proposed. Many of them rely on some sort of mechanism for identifying the least informative instances, either due to a high redundancy of the given observation or a low confidence that it is not an outlier.
One of the oldest examples of the latter are the cleaning strategies, heuristics algorithms used to remove observations deemed as inconsistent with the remainder of the data: Tomek links [24], Edited Nearest-Neighbor rule [25], Condensed Nearest Neighbour editing (CNN) [26], and more recently Near Miss method (NM) [27], constitute examples of that paradigm. Notably, these methods do not allow specifying the number of observations that should be discarded: instead, they remove all the observations meeting the cleaning rule, which can leave an undesired level of data imbalance. As a result, more recent methods tend to sort the majority observations according to the chosen criterion and allow arbitrary level of balancing. For instance, Anand et al. [28] propose sorting the undersampled observations based on the weighted Euclidean distance from the positive samples. Smith et al. [29], in their study of instance level data complexity, advocate for using the instance hardness criterion, with the hardness estimated based on the certainty of the classifiers predictions.
Another family of methods that can be distinguished are the cluster-based undersampling algorithms, notably the methods proposed by Yen and Lee [30], which use clustering to select the most representative subset of data. Finally, as has been originally demonstrated by Liu et al. [31], undersampling algorithms are well-suited for forming classifier ensembles, an idea that was further extended in form of evolutionary undersampling [32] and boosting [33].

Categorization of minority object types
Despite the abundance of different strategies of dealing with data imbalance, it often remains unclear under what conditions a given method is expected to guarantee a satisfactory performance. Furthermore, taking into the account the no free lunch theorem [34] it is unreasonable to expect that any single method will be able to achieve a state-of-the-art performance on every provided dataset. Identifying the areas of applicability, conditions under which the method is expected to be more likely to achieve a good performance, is therefore desirable both from the point of view of a practitioner, who can use that information to narrow down the range of methods appropriate for a problem at hand, as well as a theoretician, who can use that insight in the process of developing novel methods.
In the context of the imbalanced data classification, one of the criteria that can influence the applicability of different resampling strategies are the characteristics of the minority class distribution. Napiera la and Stefanowski [35] proposed a method of categorization of different types of minority objects that capture these characteristics. Their approach uses a 5-neighborhood to identify the nearest neighbors of a given object, and afterwards assigns to it a category based on the proportion of neighbors from the same class: safe in case of 4 or 5 neighbors from the same class, borderline in case of 2 to 3 neighbors, rare in case of 1 neighbor, and outlier when there are no neighbors from the same class. The percentage of the minority objects from different categories can be then used to describe the character of the entire dataset: an example of datasets with a large proportion of different minority object types was presented in Figure 6. Note that the imbalance ratio of the dataset does not determine the type of the minority objects it consists of, which was demonstrated in the above example.

Radial-Based Undersampling
In this section we describe the proposed Radial-Based Undersampling algorithm. We begin with a description of the potential estimation using radial basis functions, and the previously introduced Radial-Based Oversampling algorithm. Afterwards, we describe how the concept of mutual class potential can be applied during the undersampling of the majority objects. Finally, we discuss the computational complexity of the proposed algorithm.

Potential estimation and Radial-Based Oversampling
In the context of imbalanced data resampling, the concept of class potential was first introduced as an approach for designating the regions of interest for oversampling [17]. Specifically, it was proposed as an alternative to the regions of interest used by SMOTE and its derivatives, which did not utilize the information about the majority class distribution. Mutual class potential is a real-valued function, the value of which, in any given point in space, represents the degree of affiliation of that point to either the majority or the minority class. To calculate that potential, we assign a Gaussian radial basis function (RBF) to every observation in the considered dataset, with the polarity dependent on its class. We assume the convention of assigning a positive polarity to the majority class observations and a negative polarity to the minority class observations. More formally, given a set of majority observations K, a set of minority observations κ, and a parameter γ affecting the spread of a single RBF, we define the mutual class potential in the point x as where K i denotes the i-th object from the majority class and κ j denotes j-th object from the minority class, respectively.
Mutual class potential was used in the Radial-Based Oversampling algorithm, in which an iterative optimization was harnessed to locate the regions minimizing the absolute value of potential. Intuitively, such regions represent a low certainty towards the affiliation to either of the classes. New synthetic observations were generated in such regions with the aim of reducing the classifiers bias towards the majority class and moving the decision border in favor of the minority class. Compared with SMOTE, such approach displayed some beneficial characteristics. RBO tended to be less affected by the presence of the outliers, as well as a small number of minority objects combined with disjoint distributions. While in those cases SMOTE was likely to generate new instances overlapping the clusters of existing majority objects, using RBO resulted in a smaller, constrained regions of negative class potential in which new instances were synthesized.
The algorithm was afterwards extended to the problem of classification of noisy imbalanced data [36] and multi-class imbalanced data [37]. Furthermore, an extension omitting observations categorized as outliers was also proposed by Bobowska and Woźniak [38]. Despite leading to a favorable performance in the conducted experiments, especially in the multi-class setting, using RBO was computationally expensive due to the need of recalculating the class potential at every optimization step. Reducing the computational overhead of the algorithm was an issue identified to be essential to make the algorithm applicable to a very large datasets.

Using mutual class potential during undersampling
While originally proposed to provide the regions of interest in the process of oversampling, mutual class potential can also easily be used to guide the process of undersampling the majority class. Recall that, using the assumed convention, high value of mutual class potential in a given point in space would indicate that in its proximity there is a higher concentration of majority than minority observations. It is therefore possible to rank the existing majority observations based on their mutual class potential. We propose using such ranking mechanism to determine the order of undersampling. Specifically, we make the assumption that the majority observations with highest corresponding mutual class potential provide the least amount of information and are more redundant than the observations with lower potential. As a result, we undersample in the order of decreasing potential, updating its value for the remaining observations after each undersampled object.
We present the pseudocode of the proposed method in Algorithm 1. In addition to the collection of majority objects K and the collection of minority objects κ, algorithm has two additional parameters: spread of the individual radial basis function γ, affecting the range of impact of the associated observation on the mutual class potential, and the undersampling ratio, with radio equal to 1.0 indicating that the majority objects are undersampled up to the point of achieving balanced class distribution. Furthermore, we present a visualization of the algorithms behavior for different values of γ in Figure 3. As can be seen, the value of γ parameter significantly impacts the shape of the resulting potential: using smaller values of γ leads to a more complex potential field, affected in a given point in space only by the observations in its close proximity, whereas using larger values of γ leads to a smooth potential. As a result, the choice of γ affects the order of undersampling. For the smaller values of γ removed observations are mostly a part of local clusters consisting of several majority and no minority observations, and these clusters are never completely removed. Furthermore, individual majority observations and majority observations located in a close proximity of minority observations tend to remain unaffected. On the other hand, for larger values of γ a single cluster with a high concentration of majority objects is identified and the undersampling is performed solely within its bounds. When combined with a significant data imbalance this can lead to a potentially undesirable behavior of a very highly centralized undersampling, which may indicate the proclivity towards using lower values of γ.
Algorithm 1 Radial-Based Undersampling 1: Input: collections of majority objects K and minority objects κ 2: Parameters: spread of radial basis function γ, oversampling ratio 3: Output: undersampled collection of majority objects K 4: 5: function RBU(K, κ, γ, ratio): 6: K ← K 7: for every majority object K i in K and its associated potential Φ i do 8: x ← majority object K i from K with highest potential Φ i ; in case of multiple selected objects break ties arbitrarily 12: discard x from K 13: for every majority object K i in K and its associated potential Φ i do 14: end for 16: end while 17: return K

Computational complexity analysis
Let us define the total number of observations by n, the number of majority and minority observations by n maj and n min , respectively, and the number of features by m. Let us consider the computational complexity of RBU algorithm applied up to the point of achieving a balanced class distribution. A single calculation of the mutual class potential in any given point, as defined in Equation 1, requires n distance calculations, each with a complexity of O(m), n summations and n radial basis function calculations, both with a complexity of O(1). Therefore, a total complexity of a single mutual class potential calculation is equal to O(mn). Furthermore, the procedure of removing a single observation consists of finding the observation with a highest potential in a collection of a size not exceeding n maj , with a complexity equal to O(n maj ), discarding it from said collection, the complexity of which we will assume to be O(n maj ), and updating all of the remaining potentials, consisting of n maj distance calculations, subtractions and radial basis function calculations, leading to an overall complexity of the potential update operation equal to O(mn maj ). Combining the above, the complexity of the procedure of removing a single observation is also equal to O(mn maj ). The complete RBU algorithm consists of the initial calculation of the potential for every majority observation, with a complexity of O(mnn maj ), and a removal of n maj − n min observations, with a complexity of O(mn maj (n maj − n min )). As a results, the total complexity of the proposed RBU algorithm can be simplified to O(mn 2 ). For comparison, the complexity of the original Radial-Based Oversampling algorithm applied up to the point of achieving balanced class distributions, as discussed in [37], is equal to O(imn 2 ), with i denoting the number of algorithms iterations, the value of which was experimentally chosen to be in range from 1000 to 8000 in the conducted experiments. As a result, RBU has a significantly reduced computational overhead when compared to the original RBO algorithm.

Experimental Study
To empirically evaluate the applicability of the proposed Radial-Based Undersampling algorithm we conducted a two-part experimental study. In its first stage we examined the impact of the algorithms parameters on its performance. In the second stage we compared the algorithm with the selected state-of-the-art resampling strategies. Finally, we analysed the parameters of the datasets on which the proposed method achieved the best results to identify possible areas of applicability. In the remainder of this section we describe our experimental set-up and present the observed results.

Set-up
Data. Conducted experimental study was based on the binary imbalanced datasets provided in the KEEL repository [39]. Specifically, from the available datasets we excluded the ones containing less than 12 minority observations to avoid issues with cross-validation, as well as the ones for which AUC greater than 0.85 was achieved with SVM without any resampling, to eliminate the datasets for which, despite data imbalance, resampling was not required. A total of 50 datasets was selected using this approach. Out of them, 20 were randomly chosen for the preliminary analysis, during which the impact of the parameters on the algorithms performance was examined. Remaining 30 datasets were used during the comparison with the reference methods.
The details of the used datasets were presented in Table 1. In addition to the imbalance ratio (IR), the number of samples and the number of features, for each dataset we computed the proportion of different types of minority class observations, proposed by Napiera la and Stefanowski [35]. Specifically, the types were identified using 5-neighbourhood computed based on the Minkowski metric.
Prior to resampling and classification, categorical features were encoded as integers. Afterwards, all features were standarized by removing the mean and scaling to unit variance. No further preprocessing was applied.
Classification. Four different classification algorithms, representing different learning paradigms, were used throughout the experimental study. Specifically, we used CART decision tree, k-nearest neighbors classifier (KNN), naive Bayes classifier (NB) and support vector machine with RBF kernel (SVM). The implementations of the classification algorithms provided in the scikit-learn machine learning library [40] were used, and their default parameters remained unchanged.
With the exception of RBO, the implementations of the reference methods provided in the imbalanced-learn library [44] were used.
Evaluation. For every dataset we reported the results averaged over the 5 × 2 cross-validation folds [45]. Throughout the experimental study we reported the values of precision, recall, F-measure, AUC and G-mean. Whenever applicable, parameter selection was conducted by further 3 × 2 cross-validation on the currently considered training data, with the optimization criterion being the average of F-measure, AUC and G-mean. It should be noted that in most cases observed F-measure was significantly lower than AUC and G-mean, leading to the choice of parameters biased towards the latter two metrics.
Implementation and reproducibility. The experiments described in this paper were implemented in the Python programming language. Complete code, sufficient to repeat the experiments, was made publicly available at 1 . In addition to the code we also provided the crossvalidation folds used during the experiments, as well as a file containing complete results, enabling any further analysis.

Analysis of the impact of parameters
In the first stage of the conducted experimental study we considered the impact of two of the proposed algorithms 1 https://github.com/michalkoziarski/RBU parameters, radial basis function spread γ and the undersampling ratio, on the algorithms performance. Specifically, we conducted two experiments: in the first one we adjusted the value of γ parameter in {0.001, 0.01, 0.1, 1.0, 10.0, 100.0} while selecting the values of undersampling ratio individually for each dataset using cross-validation, with considered values in {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. In the second experiment we used the same parameter values, but adjusted undersampling ratio while selecting γ with cross-validation.
The results, averaged over 20 datasets, were presented in Figure 4. As can be seen, the best performance with respect to the combined metrics (F-measure, AUC, G-mean) was observed for smaller values of γ, equal to 0.1 or lower.
Using higher values either did not improve the average performance, or resulted in its significant decline. The exact value of γ parameter for which the best averaged performance was observed depended on the type of classifier and the chosen metric. For CART, KNN and SVM classifiers decreasing the value of γ tended to improve the recall at the cost of precision, whereas for NB the reverse was observed. It is worth noting that while the described trends were roughly monotonic for individual classifier and metric combinations, a significant peak in precision, combined with a drop in recall, was observed in the case of SVM for γ = 1.0. During the examination of the results for individual datasets it was confirmed that this peak occurred in about half of the datasets. It is not clear what caused the peak and what is its significance.
In the case of the undersampling ratio, as can be expected, increasing the ratio led to an improvement in the classifiers recall and a corresponding drop in the precision, for all of the examined classification algorithms. When considering the combined metrics, the relation between performance of the algorithm and the undersampling ratio varied depending on the metric. In the case of the G-mean, a significantly better performance was observed for the highest value of the undersampling ratio, corresponding to undersampling up to the point of achieving balanced class distribution. A noticeable improvement in performance was also observed in the case of AUC in combination with KNN ans SVM classifiers. When combined with CART and NB classifiers, the undersampling ratio did not, on average, have a significant impact on the observed AUC. In the case of F-measure the peak in performance was observed for different undersampling ratios. In the case of SVM high, but not complete, undersampling led to achieving the best results, with the best performance observed for the undersampling ratio of 0.8. In the case of KNN and NB best results were achieved for medium ratios of 0.4 and 0.6, but in the case of NB the impact of the choice of the undersampling ratio was less significant. Finally, in the case of CART the best performance was observed for small or no undersampling, with ratios equal to 0.0 and 0.2. Notably, in no case was the best performance with regard to F-measure observed for complete undersampling, up to the point of achieving balanced class distributions.
To summarize, irregardless of the choice of the classification algorithm and the evaluation metric, the best performance was observed for smaller values of γ parameter in {0.001, 0.01, 0.1}, which could all be use as a sensible default values. The choice of the undersampling ratio, however, was dependant on the classification algorithm and evaluation metric. While using complete undersampling was a sensible default with regard to AUC and G-mean, lower undersampling ratios led to observing better performance with regard to F-measure, and the choice of the classifier affected the exact value of the undersampling ratio for which the best performance was observed. It is worth noting that the optimization of parameters with respect to one of the metrics could lead to suboptimal results with respect to the other. In particular, when using the scheme employed in this paper, that is choosing the parameters maximizing the average value of F-measure, AUC and G-mean, the parameter choice will be biased towards the AUC and G-mean: this is both because the F-measure tends to take lower values, and the fact that the AUC and G-mean displayed higher correlation between each other than between the F-measure.

Comparison with other methods
In the second stage of the conducted experimental analysis we compared the proposed Radial-Based Undersampling with a total of 17 data-level methods described in We present the average rankings achieved by the respective methods for all of the performance metrics and highlight the statistically significantly different results in Table 2. Furthermore, for NB classifier we present a detailed win-tie-loss analysis, that is a visualization of the number of datasets on which RBU outperforms individual reference methods, in Figure 5. As can be seen, in general case the usefulness of the proposed RBU algorithm, when compared to the reference methods, was reliant on both the choice of the classification algorithm and the metric used to evaluate the performance. RBU achieved the best results when combined with NB classifier, scoring highest average ranks with respect to all of the combined performance metrics, and statistically significantly better results with respect to at least one of them for 10 out of 17 reference methods. Furthermore, it achieved a relatively good performance when combined with CART and SVM, scoring a statistically significantly better results with respect to at least one of the combined metrics: in 6 cases for CART, and in 7 cases for SVM. At the same time, for both classifiers a statistically significantly worse results were observed in only a single case: compared to ENN for CART classifier, and compared to SMOTE for SVM, both with respect to F-measure. The worse performance was observed when RBU was combined with KNN classifier, in which case all of the variants of SMOTE, as well as the NCL algorithm, achieved a statistically significantly better results than RBU. In that case, the latter significantly outperformed only two of the reference methods.
When compared to the RBO, RBU achieved a statistically significantly different results only in case of CART decision tree. In that instance RBU achieved a significantly better AUC and G-mean than RBO. Out of the remaining cases the highest disproportion in average ranks was observed in combination with SVM, this time in favor of RBO, but the results were not statistically significant.
It is worth noting that for CART, KNN and SVM classifiers RBU achieved a higher rank with respect to recall than the rank achieved with respect to precision, whereas the opposite was true for NB. This indicates that undersampling with RBU affects NB classification differently than the remaining classifiers, and has less severe impact on the precision of that classifier.
To summarize, while the proposed RBU algorithm, in general case, did not achieve the best results when applied in combination with all of the considered classification algorithms, it performed best when combined with NB classifier, and to a lesser extent with CART and SVM. The areas of applicability with respect to the choice of the classification algorithm partially overlap for RBU and RBO: both algorithms displayed comparatively good performance when combined with the NB classifier, but RBU scored significantly better results when combined with CART. Finally, for CART, KNN and SVM classifiers RBU achieved comparatively better recall than precision, but the opposite was true for the NB classifier.

Analysis of the impact of dataset characteristics
In the final stage of the conducted experimental study we examined if the performance of the algorithm changes depending on the characteristics of the dataset on which it is applied. Specifically, we considered the categorization proposed by Napiera la and Stefanowski [35] to evaluate the fraction of minority objects belonging to one of the categories: safe, borderline, rare and outlier, for each individual dataset. Afterwards, we examined the relation between the fraction of objects of a given type and the rank the RBU method achieved compared to the reference   algorithms. In the Table 3 we present the Pearson correlation coefficient between the percentage of the objects of a given type and the rank obtained for that dataset, with highlighted statistically significant correlations at the significance level α = 0.10. Furthermore, in Figure 6 we present scatterplots containing individual data points with a linear regression model fit. Note that ranking was performed in a descending order, meaning that the best performing method received the rank equal to 1. Therefore, negative correlation and regression slope indicate that the relative rank increases.
Similarly to the results observed during the comparison with reference methods, the trends observed for CART, KNN and SVM classifiers differed from the ones observed for the NB classifier. For all of the three former classifiers a statistically significantly worse recall, compared to the reference methods, was observed when the datasets contained a high proportion of safe objects. Conversely, a statistically significantly better recall was observed for the datasets containing higher proportion of rare and outlier minority objects. This, in turn, led to a significantly higher values of AUC and G-mean for the datasets containing a larger proportion of rare and outlier minority objects, and significantly lower values of these metrics for datasets containing large proportion of safe objects. In the case of NB classifier, on the other hand, a significantly worse results, compared with the reference methods, were observed with regard to AUC and G-mean for datasets with a large proportion of outliers, and significantly better results with regard to AUC for datasets with a large proportion of borderline minority objects. However, no significant relations with regard to precision or recall were To summarize, the results of the analysis of the impact of dataset characteristics indicate that the proposed RBU algorithm, when used with CART, KNN or SVM classifier, is particularly well suited for resampling datasets with a high proportion of rare and outlier minority objects, but achieves a relatively worse performance for safe datasets. This is caused mainly due to the differences in the classification recall. However, these trends do not extended to the case of the NB classifier, for which the observed performance with regard to the AUC and G-mean was actually worse when datasets contained a high proportion of outliers.

Conclusions
Throughout this paper we proposed a novel undersampling algorithm, Radial-Based Undersampling, based on a previously introduced concept of mutual class potential. The main motivation behind the proposed algorithm was extending the notion of non-nearest neighbor based resampling, previously used in Radial-Based Oversampling, to the undersampling procedure. The proposed method offers a conceptually simple and computationally more efficient alternative to the Radial-Based Oversampling algorithm. In the conducted experimental study we empirically evaluated the usefulness of the proposed method. Through the course of the study we were able to identify the areas of applicability of the algorithm. Specifically, the observed results indicate the suitability of the algorithm to be used in combination with naive Bayes classifier and, to a lesser extent, CART decision tree and support vector machine. Compared to the Radial-Based Oversampling, RBU displayed a statistically significantly better performance when combined with CART decision tree. Furthermore, we were able to analyse the behavior of the proposed algorithm with respect to the characteristics of datasets on which it was applied. For the majority of the examined classification algorithms proposed method achieved comparatively better results when used on difficult datasets, consisting of higher proportion of rare and outlier minority instances.
Despite the relative simplicity of the proposed criterion of undersampling selection, the observed results are encouraging for further development of the algorithm. Specifically, we intend to explore the possibility of using other selection criteria based on the idea of mutual class potential, with a particular focus on more theoretically motivated choices. clf = SVM | type = outlier Figure 6: Scatterplots representing relation between the percentage of the objects of a given type: safe (blue), borderline (orange), rare (green) and outlier (red), and the rank achieved by RBU on the given dataset, with regard to F-measure (left grid), AUC (middle grid) and G-mean (right grid). Each row contains data for a single classifier, from the top: CART, KNN, NB and SVM. 95% confidence invervals were shown.