Probability Density Machine: A New Solution of Class Imbalance Learning

Class imbalance learning (CIL) is an important branch of machine learning as, in general, it is difficult for classification models to learn from imbalanced data; meanwhile, skewed data distribution frequently exists in various real-world applications. In this paper, we introduce a novel solution of CIL called Probability Density Machine (PDM). First, in the context of Gaussian Naive Bayes (GNB) predictive model, we analyze the reason why imbalanced data distributionmakes the performance of predictive model decline in theory and draw a conclusion regarding the impact of class imbalance that is only associatedwith the prior probability, but does not relate to the conditional probability of training data. 'en, in such context, we show the rationality of several traditional CIL techniques. Furthermore, we indicate the drawback of combining GNBwith these traditional CIL techniques. Next, profiting from the idea of K-nearest neighbors probability density estimation (KNN-PDE), we propose the PDM which is an improved GNB-based CIL algorithm. Finally, we conduct experiments on lots of class imbalance data sets, and the proposed PDM algorithm shows the promising results.


Introduction
Learning from imbalanced data is an important and hot topic in machine learning, as it has been widely applied to diagnose and classify diseases [1,2], detect software defects [3,4], analyze biology and pharmacology data [5,6], evaluate credit risk [7], predict actionable revenue change and bankruptcy [8,9], diagnose faults in the industrial procedure [10,11], classify soil types [12,13], and even predict crash injury severity [14] or analyze crime linkages [15]. Meanwhile, class imbalance learning (CIL) is also a challenging task. It is known that most existing supervised learning models are constructed on the theory of empirical risk minimization; thus, they always tend to favor the majority classes, but ignore the performance of the minority classes [16,17].
To explore the reason why imbalanced data hurt the traditional predictive models, several previous works have indicated some reasons from different aspects [34,35,43,44]. Most viewpoints consider both skewed density distribution and the empirical risk minimization learning rule causing the degradation of modeling quality. In fact, for imbalanced data, the density distributions of different classes often vary sharply; however, their probability densities are generally invariable. is means if we could estimate the probability density distribution of each class accurately, it would become easier to solve CIL problem. Motivated by this viewpoint, we develop a novel solution called Probability Density Machine (PDM) for addressing CIL problem in this study.
Firstly, to verify the validity of the idea, we analyze the reason why imbalanced data distribution hurts the performance of predictive model in theory, in the context of Gaussian Naive Bayes (GNB) predictive model [45,46]. It is indicated that the negative impact of class imbalance only is associated with the prior probability (density distribution), but it is not related to the conditional probability (probability density distribution) of training data, which verifies our argument. en, in the same theoretical framework, we explain the rationality of several traditional CIL techniques, including data-level and algorithm-level methods. Next, we indicate that the conditional probability estimation in GNB is generally inaccurate as the data might be far from the Gaussian distribution. Furthermore, we introduce the K-nearest-neighbors-probability-density-estimation-(KNN-PDE-) [47] alike algorithm to improve the accuracy of estimating probability density distribution and hereby propose the PDM algorithm. Finally, the effectiveness and feasibility of the proposed PDM model are verified by lots of experiments. In comparison with several traditional CIL solutions, the PDM shows promising results. e rest of this paper is organized as follows. Section 2 reviews related work regarding CIL techniques briefly. Section 3 describes the CIL problem and analyzes the reason why class imbalance distribution hurts the performance of predictive model in the context of GNB and further explains the rationality of several traditional CIL solutions. In Section 4, we introduce an KNN-PDE-alike algorithm and then explain how to use it to develop PDM algorithm. Section 5 presents the experimental results and provides the corresponding discussion. Finally, Section 6 concludes the paper and outlines future work.

Related Work
In the recent two decades, several hundreds of different CIL methods have been proposed. ese approaches can be roughly divided into three groups: data-level, algorithmiclevel, and ensemble learning. Next, we wish to briefly review the work related to each of these three groups, respectively.
Data-level, which is also called sampling, addresses CIL problem by rebalancing data distribution. It either increases the number of minority instances (oversampling) [20][21][22][23], or decreases the number of majority instances (undersampling) [24,25], or uses both strategies simultaneously (hybrid sampling) [26,27]. Random undersampling (RUS) [20] and random oversampling (ROS) [20] are the simplest sampling methods, where the former randomly removes majority instances, while the latter randomly copies minority samples. ese two methods both have inherent drawbacks; i.e., RUS loses much information which is helpful for modeling, and ROS tends to make the classification models overfitting. e Synthetic Minority Oversampling Technique (SMOTE) [21] is an effective method to solve the problems brought by RUS and ROS. It generates synthetic minority instances between two adjacent real minority examples. However, SMOTE is inclined to propagate noises, which will also degrade the quality of training data. In recent years, some improved SMOTE algorithms have been proposed, including MWMOTE [22] that combines clustering technique and SMOTE to make new instances generate around those hardto-learn samples and avoid emerging on wrong positions; MCT [23] which alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class; SMOTE-RSB [26] that takes advantage of rough set theory as a cleaning tool to search and remove minority noisy samples generated by SMOTE; and SMOTE-IPF [27] which adopts ensemble-based filters technique to prevent noise propagation. To avoid the information loss caused by RUS, Yu et al., [25] also integrate ant colony optimization into the procedure of undersampling to adaptively search those most important minority examples. All in all, the data-level technique has two major merits: (1) it is easier to be implemented as only data are manipulated; (2) it is irrelevant with classification models.
Unlike data-level methods, the algorithmic-level approaches focus on modifying the classification model itself to make it adapt class imbalance environment. Cost-sensitive learning and threshold moving strategy are the most popular algorithmic-level CIL methods. Cost-sensitive learning adapts the imbalanced data distribution by assigning different misclassification penalties for the instances belonging to different classes. e allocation of misclassification penalties might be either empirical [28,30] or personalized [29,[31][32][33]. In addition, many supervised learning models can be made to be sensitive to cost, including neural networks [28], support vector machine [29,[31][32][33], and Bayesian network [30]. reshold moving adopts a postprocessing strategy to force the classification hyperplane trained by any traditional supervised learning model to move toward the area of majority instances [34,35]. e moving distance can be determined by either empiricism or optimization technique.
Ensemble learning is not an independent strategy to address CIL problem, but only a solution to promote the performance of other CIL methods. at is to say, ensemble learning must combine with one of the data-level or algorithmic-level methods to solve CIL problem. For examples, RUSBoost integrates RUS into boosting ensemble learning paradigm [36]; SMOTEBagging combines SMOTE with bagging ensemble learning paradigm [37]; AdaCost [38] and AdaC series algorithms [39] organize cost-sensitive learning and boosting; and EnSVM-OTHR [34] puts threshold moving strategy into bagging. Some recent studies also focus on class imbalance in high-dimension environment [40] and focus on how to improve the diversity of base models in CIL ensemble [41,42]. Although ensemble learning always has a high time-complexity, it has presented potential to improve the quality of CIL models in various real-world applications.

Gaussian Naive Bayes (GNB) Predictive
Model. As a variant of the Naive Bayes (NB) model, GNB has been widely used to solve various real-world supervised learning issues 2 Scientific Programming [45,46]. Although both NB and GNB are based on Bayes theorem, there exists an essential difference between them; i.e., NB can be only used to modeling the data with discrete features, while GNB is extended to deal with the data in continuous feature space. In order to model data with continuous features, GNB needs to destroy the feature-independent hypothesis adopted by NB. Actually, GNB assumes in data the instances belonging to each class satisfy an independent Gaussian distribution whose mean and variance can be directly obtained by calculating feature space. en for any one given instance x i , it is easy to calculate its conditional probability P (x i |y) in a class y, and meanwhile it is not difficult to acquire the prior probability P (y) by directly counting the ratio of instances belonging to the class y in training set. According to the Bayes theorem, its posterior probability P (y|x i ) can be calculated by where P (∼y) and P (x i |∼y) represent the prior probability of the ∼y class and the conditional probability of the instance x i in the class ∼y, respectively. en, GNB decides the class label of x i by comparing P (y|x i ) and P (∼y|x i ) based on the following rule: 3.2. Why Class Imbalance Hurts GNB? Next, in the context of GNB, we wish to analyze the reason why imbalanced data distribution will hurt the performance of traditional predictive models. Specifically, suppose we face a binary-class imbalanced classification task, and the training set can be represented asΦ � (x i , y i )|x i ∈ R n , y i ∈ Y 1 , Y 2 , 1 ≤ i ≤ s}, where Y 1 and Y 2 denote the class labels of the majority and minority instances, respectively. According to the class label, V can be further split into two subsets where s � s 1 + s 2 and s 1 > s 2 . Furthermore, to simplify the analysis, we restrict the dimension of the feature space in V as 1, no matter if Y 1 or Y 2 satisfies a Gaussian distribution. en, we can directly acquire the prior probability and the conditional probability of each class. According to the Bayesian formula, we can respectively calculate the posterior probability belonging to each class as follows: Based on the assumption above, Figure 1 shows why imbalanced data distribution can hurt the performance of predictive models.
As shown in Figure 1, the classification boundary has been obviously pushed toward the minority class, owing to the fact that P(Y 1 ) is larger than P(Y 2 ). Considering the context of GNB, the classification boundary associates with the condition P( , then it means only whenP(X|Y 1 ) < P(X|Y 2 ), we can select the corresponding X as the classification boundary. erefore, we can say that the negative impact of class imbalance for traditional supervised learning models only associates with the difference of prior probability (density distribution) belonging to different classes, but is irrelevant to the conditional probability (probability density distribution) of each class in training data. In other words, class imbalance data possess probability density invariance. Ignoring the impact of prior probability, it is not difficult to deduce Any CIL solution tries its best to eliminate or at least alleviate the influence of prior probability, further making the model totally satisfy or approximately conform to the condition in (4).

e Explanation of Rationality of Several Traditional CIL
Techniques. According to the theoretical analysis above, it is not difficult to explain the rationality of some traditional CIL techniques.
First, let us consider the data-level strategy. It solves the skewed density distribution problem by either increasing P(Y 1 ), or decreasing P(Y 2 ), or simultaneously increasing P(Y 1 )and decreasing P(Y 2 ), further making P (Y 1 ) � P(Y 2 ). Obviously, this strategy satisfies the condition in (4).
Next, cost-sensitive learning focuses on adjusting prior probability distribution by designating different costs for different classes. at means (3) can be transformed to  Scientific Programming 3 where C 1 and C 2 denote the cost of the majority class and minority class, respectively. Generally, the cost assignment relates with the class imbalance ratio (IR), i.e., , which also satisfies the condition in (4).
Finally, we wish to investigate the rationality of threshold moving technique. It often adds a compensation threshold λ for the decision value of minority class, i.e., where λ is usually a positive value. Specifically, when λ � P(X|Y 2 )(P(Y 1 ) − P(Y 2 ))/P(X), it exactly satisfies the condition in (4). erefore, the rationality of various CIL techniques can be explained well within the theoretical framework of GNB, and they all can be seen as different strategies to balance the prior probabilities, further making (4) be workable.

KNN-PDE-Alike Conditional Probability Density Estimation
Strategy. Motivated by (4), we observe a new potential CIL solution, i.e., neglecting prior probability and directly estimating the conditional probability belonging to each class to make decision. e solution avoids the tedious procedure of balancing prior probabilities and solves CIL problem in nature. e problem seems to get easier; however, it is still difficult to provide an accurate estimation for conditional probability. At least for GNB, it cannot precisely measure the conditional probability, as in many real-world applications, the data do not satisfy the Gaussian distribution. Figure 2 presents two different groups of data with the same mean and variance, but different distributions. It is clear that the GNB is able to provide an accurate estimation for the first group of data, while on the second group of data, the estimation will be extremely biased.
In order to provide accurate and robust probability density estimation, we present a novel strategy named KNN-PDE-alike algorithm. As its name indicates, KNN-PDE-alike algorithm is quite similar to KNN-PDE algorithm [47], but it has better adaptability for various complex data distributions.
In a group of data, KNN-PDE-alike algorithm firstly searches the Kth nearest neighbor for each instance x i and records the Euclidean distance between x i and its Kth nearest neighbor as d K i . Obviously, a smaller d K i denotes a higher probability density corresponding to x i , and vice versa. It is a counterintuitive value principle; hence, a transformation rule should be designed. In this study, we adopt a simple transformation rule that takes the reciprocal of distance to express the probability density. at is to say, for a sample x i , its probability density could be represented aso i � 1/d K i . Now, the probability density can be still represented as a function of the distance variance, which is inconsistent with the definition of probability density. erefore, the rough probability density o i should be further refined by the following transformation: where P (x i ) denotes the refined probability density of x i and can be calculated by e transformation can be seen as a normalization procedure of generating probability density. In fact, P (x i ) is still not real probability density; however, it reflects the real proportional relation among probability density of different examples. erefore, we call this kind of probability density the relative probability density.
According to the procedure description above, we provide the pseudocode of KNN-PDE-alike algorithm in Algorithm 1.

Probability Density Machine (PDM) Algorithm.
It is clear that we can accurately estimate the relative probability density for each sample by virtue of KNN-PDE-alike strategy. Also, the relative probability density can be regarded as a scaling transformation of the corresponding real conditional probability density. erefore, according to (4), we can directly make decision for the class label of a sample by comparing its relative probability densities in different classes. We call this method as Probability Density Machine (PDM).
However, we note that, in class imbalance data, different classes always have different number of samples, which means the relative probability densities in different classes calculated by (8) could not be compared directly, as they are scaled in different scales. To make them comparable, we should conduct an extra normalization step. For instance, x i , its normalization relative probability density in a majority class j can be calculated by where P (x i ) and P′ (x i ) respectively denote the original and normalized relative probability density of the instance x i in jth class, while s j and s min respectively represent the number of instances in the jth class and the class which has the least instances.
In addition, we should focus on another important parameter which might influence the accuracy of relative probability density estimation, i.e., the neighborhood parameter K. It is obviously inappropriate to designate a 4 Scientific Programming common K for all classes; hence, we design an adaptive assignment rule that is automatically associated with the number of instances included in a specific class. Furthermore, we also observe that it is not suitable for designating a too large or too small value for the parameter K, as a too large K tends to destroy the difference of different density regions, while a too small K would be sensitive to noises. Hence, in this study, we empirically designate K � � s √ as the default setting. e influence of the parameter K on the quality of PDM algorithm will be further discussed in Section 5.
According to the procedure description above, we provide the pseudocode of Probability Density Machine (PDM) algorithm in Algorithm 2.
From the pseudocode of PDM algorithm, we observe that it is very similar with those lazy learning algorithm, e.g., K-nearest neighbors classifier, as it needs to conduct lots of operations in testing phase. Actually, there exists a significant difference between the PDM and those lazy learning algorithms; i.e., lazy learning does not need training, but the PDM needs to conduct a complex training procedure for acquiring the normalization factor , the number of instances s, and the parameter K in each class, respectively. erefore, we call PDM a semilazy learning algorithm. Furthermore, to verify the effectiveness of PDM algorithm, we also compare it with GNB algorithm on two imbalanced artificial data sets, where A1 contains 1000 majority instances satisfying the Gaussian distribution of  (i) Input: A data set Ψ � {x i |x i & R n ; n, 1≤i ≤ s}, the parameter K Output: A 1 × s vector ɭ recording the relative probability density of all instances Procedure: Calculating the Euclidean distance between x i and x j , and recording it as d ij ; End if (6) End for (7) Sorting all distances for the instance x i in ascending order, then finding the Kth one and recording it as  Scientific Programming mean � 0.5 and variance � 0.1 and 100 minority instances satisfying the Gaussian distribution of mean � 0.7 and variance � 0.1, while A2 contains 5000 majority instances and 100 minority instances which satisfy the same distribution as A1. at is to say, both data sets have the same distribution, but different class imbalance ratios. Figure 3 presents the instance distribution of these two data sets, and the decision regions produced by GNB and PDM, respectively.
In Figure 3, we observe that the GNB is very sensitive to the variance of class imbalance ratio (IR), and with the increase of IR, the decision region of minority class would be significantly shrunken. By contrast, the PDM seems to be robust as its decision boundary is approximately invariable at the background of varying IR. is fact verifies our deduction; i.e., the impact of imbalanced data distribution is only related to prior probability, but not the conditional probability, again.

Time-Complexity
Analysis. Finally, we focus on the time complexity of the proposed PDM algorithm. As a semilazy learning algorithm, the conduction of the PDM algorithm can be divided into two subprocedures.
Suppose the data set includes s instances which can be divided into m classes, and each instance holds n attributes; it mainly costs O (s) time to divide the data subsets belonging to different classes and to determine the parameter K in each class, O (m 2 logm) time to find s min , O (s 2 n) time to calculate the distance among all instances, O (s 2 logs) time to sort the distances, and O (s) time to conduct reciprocal operation and to calculate the normalization factor in each class, in training subprocedure. Considering real-world applications, the number of training instances is generally much larger than the number of classes, thereby we can say that the number of training instances s mainly dominates the time complexity in training phase. On low-dimensional data, i.e., logs >> n, its time complexity is O (s 2 logs), while on highdimensional and small sample data, i.e., logs << n, the time complexity becomes O (s 2 n).
During testing subprocedure, for a testing instance, it will consume O (sn) time to calculate distances, O (slogs) time to sort distances, O (m) time to conduct reciprocal operation and to calculate the relative probability density, and O (m 2 logm) time to provide prediction. In comparison with training subprocedure, the time complexity of testing subprocedure could be negligible.
Based on the analysis above, we have to admit that the proposed PDM algorithm needs to sacrifice much more running time than that in GNB. PDM may be accurate, effective, and robust, but not efficient.

Data Set Description.
We collected 30 binary-class and 10 multiclass imbalanced data sets to verify the effectiveness and feasibility of the proposed PDM algorithm. e Input: An imbalanced training set Ψ � {(x i , y i ) | x i ∈&R n ; n , 1≤i ≤ s, y i ∈{Y 1 , Y 2 ,..., Y m }}, a test instance x′ Output: An predictive class label y' for x′ Training Procedure: End if (6) End for (7) Recording the number of instances in Φ i as s i , and calculating the corresponding K i ; (8) Calling KNN-PDE-alike algorithm to acquire and record i ; (9) End for (10) Ranking all s i to get s min ; (11) Calculating each parameter K i based on s i . Testing Procedure: Calculating the Euclidean distance between x′ and x j , and recording it as d ij ; (4) End for (5) Sorting all distances in ascending order, then finding the K i th and recording it as d K i ; (6) Transforming d K i to be o i ' by the reciprocal rule; (7) Calling i and d K i to calculate the relative probability density P i (x′) for x′ by (8); Calling s i and s min to adjust the relative probability density P i (x′) by (10); (10) End if (11) End for (12) Ranking m relative probability densities in descending order; (13) Output the class label y'∈{Y 1 , Y 2 ,..., Y m } which ranks first as the prediction for the test instance x′. collection includes 13 data sets acquired from UCI machine learning repository [48] and 27 data sets collected from Keel data repository [49]. ese data sets have different number of instances, number of attributes, number of classes, and class imbalance ratios (IR). e detailed description about these data sets is provided in Table 1.

Experimental Settings.
All experiments were run on a 2.60 GHz Intel (R) Core (TM) i7 6700HQ 8 cores CPU with 16 GB RAM using MATLAB 2013a running environment.
To show the effectiveness and superiority of the proposed PDM algorithm, we compared it with the following algorithms: (1) GNB: it adopts GNB classifier [45,46] to directly train the data without considering the skewed data distribution. (2) CS-GNB: it first trains GNB classifier [45,46] on original data and then empirically assigns decision costs for the posterior probability in each class (see (5) and (6)). (3) RUS-GNB: it firstly calls random undersampling (RUS) algorithm [20] to generate a balanced training set from original data and then trains GNB classifier [45,46] on the new balanced data. (4) ROS-GNB: it firstly calls random oversampling (ROS) algorithm [20] to generate a balanced training set from original data and then trains GNB classifier [45,46] on the new balanced data. (5) SMOTE-GNB: it firstly calls Synthetic Minority Oversampling Technique (SMOTE) algorithm [21] to generate a balanced training set from original data and then trains GNB classifier [45,46] on the new balanced data. In SMOTE, it adopts a default parameter K � 5.
As we know, when we evaluate the performance of a CIL classifier, the total classification accuracy is not an excellent performance evaluation metric any longer; hence, we adopted F-measure and G-mean as performance evaluation metrics, where F-measure tests the tradeoff between precision and recall, while G-mean tests the tradeoff among the accuracy of each class.
To impartially compare the performance of various algorithms, we adopted 10 times' randomly external 5-fold cross validation to calculate the average performance as final results. Table 2 and Table 3 present the F-measure and G-mean results of six algorithms on 40 imbalanced data sets.

Results and Discussion.
From the results in Tables 2 and Table 3, it is not difficult to draw some conclusions as below.
(1) We observed an interesting result; i.e., although GNB fails to consider class imbalance issue in data distribution, it performs no worse than CS-GNB, RUS-GNB, ROS-GNB, and SMOTE-GNB algorithms on most data sets. Specifically, GNB acquires the best F-measure and G-mean results on 7 and 5 data sets, respectively. We think this counterintuitive phenomenon is perhaps associated with two different reasons. e first and also the most important reason lies in that the instance distribution in lots of used data sets might be very similar with Gaussian distribution. In addition, on those data sets which GNB performs better, perhaps there exists a large margin among different classes, further causing a biased model that also could accurately classify most instances. (2) Class imbalance ratio seems to be an important influence factor on the performance of GNB, which is specifically significant in terms of G-mean metric.
In particular, on abalone19 and poker8v6 data sets which hold very high class imbalance ratios, GNB only acquires 0.2402 and 0.0000 G-mean values. e results prove our theoretical deduction again. (3) Among those GNB variants, SMOTE-GNB shows obviously better performance than its three competitors. It is not difficult to analyze the reasons why RUS tends to lose some important information to describe the real data distribution and ROS decreases the accuracy of probability density estimation as only inserting some synthetic minority instances on those intrinsic positions, and although a cost-sensitive strategy avoids the impact of biased prior probability, it still could not provide a robust estimation for real conditional probability density. As a synthetic oversampling technique, SMOTE fills the space among intrinsic minority instances, further promoting the accuracy of conditional probability density estimation. is explains why SMOTE-GNB outperforms the other algorithms. (4) It is clear that our proposed PDM model is superior to the other five algorithms. Specifically, PDM has acquired the best results on 29 and 26 data sets in terms of F-measure and G-mean, respectively. Actually, PDM promotes the modeling quality from two different aspects: one is totally balancing the prior probabilities and the other is providing approximately accurate conditional probability density estimation for any types of data distribution. e results also further indicate the effectiveness and feasibility of the proposed PDM algorithm in this study.

Significance Analysis in Statistics.
To present a thorough understanding of the comparison of various algorithms, we also provide their statistical results. We employ Friedman ranking test and Holm post-hoc test [50,51] to differentiate the performance of the comparative methods on 40 data sets. e statistical analysis results obtained at the confidence level of α � 0.05 are presented in Table 4 and Table 5. e statistical results in Table 4 and 5 show that the proposed PDM algorithm acquires the lowest average ranking values in both terms of F-measure and G-mean 8 Scientific Programming metrics, indicating that it is the best one among the comparative algorithms. For both metrics, the Holm post-hoc test rejects all comparative algorithms except SMOTE-GNB as their p values are smaller than the corresponding Holm values, which verifies that the PDM significantly outperforms these comparative algorithms. Of course, we cannot say there exists a significant difference between the PDM and the SMOTE-GNB, though the former has a lower average ranking than the latter. All in all, the proposed PDM algorithm presents more superior performance and lower average ranking than all other competitors and thereby we can safely recommend it as an effective solution to address CIL problem.

5.5.
Discussion about the Impact of Parameter K. Next, we wish to discuss the sensibility of parameter K. As we know, K is also sole parameter in the PDM algorithm. In Section 4, we design it as a function of the number of instances s. To make clear whether the modeling quality of PDM is sensitive to K or not, we vary the value of K in the range of and then observe the variation of performance of PDM. e results on three different data sets are presented in Figure 4.
In Figure 4, we observe that the classification performance of PDM is extremely sensitive to the parameter K, as with the variation of K, the classification performance changes drastically. erefore, in practical applications, this parameter should be carefully selected and designated.
From the results in Figure 4, we also find that although for different performance metrics each data set has different best K values, all curves still show a common varying trend; i.e., with the increase of K, the performance firstly improves until a peak value, and then it will decline sharply. e observation is in accordance with our analysis in Section 4. at is to say, either too small or too large K value goes against the accurate estimation for conditional probability density. A too small K value would be sensitive to noises, while a too large K value would tend to destroy the difference of different density regions. According to the results presented in Figure 4, we suggest to designate K as a value lying between � s √ /2 and 2 � s √ . In realworld applications, the best K value can be determined by an internal cross-validation, too.

Concluding Remarks
In this study, from a theoretical perspective, we tried to analyze the reason why class imbalance distribution hurts the performance of predictive model in context of Gaussian Naive Bayes classifier. It is deduced that the hazard of imbalanced data distribution is only associated with prior probability, but not with conditional probability density. Accordingly, we analyzed the rationality of several popular class imbalance learning techniques and indicated the necessity of accurately estimating the conditional probability density in data distribution. en, a robust probability density estimation algorithm named KNN-PDE-alike is proposed, and based on this algorithm, a novel class imbalance learning solution named PDM is further presented. Experimental results on 40 binary-class and multiclass imbalanced data sets have shown the effectiveness and superiority of the proposed PDM algorithm. e proposed PDM algorithm holds several merits as follows: (1) It has a good theoretical basis (2) It is insensitive to data distribution and is able to adapt various types of data distribution (3) It can be directly used to classify not only binaryclass imbalance data but also multiclass imbalance data In future work, the possibility of developing more robust and fast probability density estimation strategy will be investigated. e effectiveness and superiority of PDM will be further verified by applying it in more real-world class imbalance applications, too.
Data Availability e data sets used in this paper were collected from the openly UCI machine learning repository and Keel data repository (http://archive.ics.uci.edu/ml/datasets.php and https://sci2s.ugr.es/keel/datasets.php). e MATLAB codes of PDM algorithm can be downloaded from https://github. com/yuhualong1982/Probability-Density-Machine.

Conflicts of Interest
e authors declare there are no known conflicts of interest.