A New Framework Consisted of Data Preprocessing and Classifier Modelling for Software Defect Prediction

,


Introduction
In software engineering, software defect prediction (SDP) refers to estimating precisely the faulty modules and helping software engineers allocate limited testing resources to the most fault-prone modules prior to the testing and maintenance phases.During the phase of predicting defects, the most important step is to establish a robust defect prediction model (also known as the defect predictor), which has motivated many researchers to design and establish various predictors which cover different aspects of software quality [1][2][3][4][5][6][7].These proposed models, in general, could be divided into two groups: Within-Project Defect Prediction (WPDP) and Cross-Project Defect Prediction (CPDP).WPDP refers to the predictors that are trained from data of historical releases in the same project and estimates the most faultprone modules in the new releases [8].Zimmermann et al. [9] argued that the defect predictors would perform well within projects as long as there are sufficient data available to train the models.However, it is not always able to collect such kind of historical data for new projects.Thus, WPDP may be restricted to some cases where the size of historical releases is limited.
An alternative to predicting defects in new projects lack of historical dataset is to utilize data from different projects as training samples, i.e., CPDP.More formally, CPDP aims to detect the most fault-prone modules in a project using predictors trained by the historical data across projects [2].One of the key components for the predictors is how to appropriately build machine learning algorithms of SDP.Thus, there are two main factors, i.e., selecting the training dataset appropriately and evolving the learning algorithm, considered by researchers to boost the effectiveness of CPDP.
For the first issue, many researchers have found that the performance of predictors is seriously affected by the quality of training data [10][11][12][13].Data   suffers the challenges of superfluous instances across projects and redundant unnecessary features (also known as software metrics), respectively.It has been proven that instance filtering (denoting similar dataset selection) and feature selection can efficiently handle with the aforementioned problems affecting data quality, respectively [14,15].To the best of our knowledge, however, there are few literatures discussing how to perform instance filtering and feature selection simultaneously to improve training data quality in SDP.
For improving the learning algorithm, first it should be noted that the main task of CPDP is to classify modules of the new software projects into two classes: non-fault-prone (NFP) or fault-prone (FP).Hence, various classifiers (i.e., the learning algorithms) have been employed to establish CPDP, such as Logistic Regression (LR), Naïve Bayes (NB), Random Tree (RT), and Support Vector Machine (SVM) [5][6][7].Compared with LR, RT, and SVM (more detailed information is shown in Section 4.3.3),Menzies et al. [16] argued that NB may be more suitable than the other classifiers [17].The implementation of NB is based on an assumption that datasets of our interest are normally distributed.However, after we conducted a robust normal distribution hypothetical test (so-called Kolmogorov-Smirnov test [18][19][20]) on 32 different software releases (the referenced datasets varying from small to large size are described in Table 5 and available in the PROMISE data repository (available at http://openscience.us/repo)), we found that the 21 metrics (see Table 6 including 20 feature attributes and 1 labeled attribute) most commonly used in software tend to follow nonnormal distribution.Therefore, it is not appropriate to select NB as the classifier for CPDP since NB is built on the hypothesis that is inconsistent with the results obtained from the Kolmogorov-Smirnov test (the associated details are referred to in Table 7 and [21]).
To overcome the two main existing shortcomings mentioned above in CPDP, we develop a new CPDP (see Figure 1).First, as proposed by Turhan et al. [22], Euclidean distance is effective in measuring similarity between the given releases and historical datasets across projects and helps select proper training data automatically.The classic k-means algorithm [23] is well-known for its efficiency in clustering datasets based on the Euclidean distance norm.Therefore, we attempt to introduce k-means algorithm in combination with correlation analysis and random sampling for instance filtering, feature selection, and instance reduction in data preprocessing, respectively.
Second, we attempt to establish a new classifier based on maximum correntropy criterion (MCC) [24], which is well-known for its efficiency in handling non-Gaussian or nonnormal noise, along with an  2 norm penalty.MCC has been verified to obtain robust analysis in face recognition [25], image processing [26], and regression analysis [24], to name a few.However, it is quite difficult to solve the optimization problem of the correntropy combined with  2 norm penalty.Thus, Half-Quadratic (HQ) optimization framework in a proper manner is used to optimize the objective function.
In a word, this paper aims to integrate the advantages of k-means and MCC to boost the potential of CPDP.For simplicity, the newly proposed framework for identifying fault-prone modules is denoted as KMP for k-means and MCC-based predictor.To investigate the efficiency of KMP, we conduct an experiment on the aforementioned 9 software projects and find answers to the following research questions.

RQ1: Is the MCC-based classifier better to obtain prediction results compared with other commonly used learning algorithms?
This question attempts to evaluate the performance of MCC-based classifier for the prediction in the form of predicting the software module of our interest being faulty or nonfaulty.

RQ2: How do KMP perform in comparison to the state-ofthe-art method?
He et al. [8] proposed a method of finding training dataset across projects by testing tremendous combinations of historical releases from other software projects.This novel approach with a simplified software metric set has been also validated by He et al. [27] and even outperforms the initial one equipped with all metrics in most cases.Therefore, we would like to include the new He et al. 's method as a yardstick to investigate whether KMP can obtain better prediction results.
The rest of this paper is structured as follows.Section 2 gives a brief overview of CPDP, k-means algorithm, and MCC.Section 3 presents a new proposed methodology for predicting software defects in software systems.Section 4 designs two research questions and the associated experimental settings.The prediction results are shown in Section 5 and discussed in Section 6.Finally, we close this paper with some summaries in Section 7.

Preliminaries
To facilitate readers' easy understanding, we review the related work of CPDP, some basic concepts of k-means algorithm, and maximum correntropy criterion, respectively, in this section.

Related Work about CPDP.
Brian et al. [28] may be the first pioneers who published an exploratory literature associated with CPDP to the public.They utilized a simple but rigorous algorithm to analyze an open-source project (called Jwriter) based on another software project (called Xpose) and obtained a relatively accurate fault-prone class ranking.Then the significance of CPDP motivated many researchers to develop various novel models for software fault prediction.Turhan et al. [22] successfully proposed a nearest-neighbor filtering approach to selecting robust crossproject data for training prediction models.Rahman et al. [29] implemented a cost-sensitive analysis of CPDP on 9 large Apache Software Foundation projects with 38 releases.They revealed that the cost-sensitive CPDP performance was not worse than WPDP performance.In recent years, a method for the ensemble of different fault prediction techniques, such as Poisson regress, multilayer perception, and genetic programming, has been proven to be very effective in SDP for it takes the advantage of each participating algorithm and can provide more acceptable prediction results compared to the individual technique [30,31].From previous empirical studies on CPDP, we can find that relatively little attention has been paid to propose a predictor concerning both training data selection and learning algorithm optimization simultaneously.

K-Means
Algorithm.k-means algorithm is one of the most used clustering techniques, which minimizes the following objective function with respect to cluster center   : where   is the number of clusters and (⋅), in general, is the Euclidean distance norm.x  denotes each data point of the interest (e.g., an instance of software projects), x  ∈ X.Let C = { 1 ,  2 , . . .,    } be a partition of X, ⋃   =1   = X.Thus, the k-means algorithm is executed in the following steps [23]: (1) Initialize the center   of cluster   ,  = 1, 2, . . .,   (2) Assign x  to the cluster   if (3) Update the center   by (4) Repeat steps (2) and (3) until the value of   is no longer decreasing The computational complexity of k-means algorithm is very low because of its quick convergence and the low cost of the distance computation.Additionally, k-means algorithm measures the similarity between each pair of data point mainly based on the Euclidean distance, whose efficiency in software fault prediction has been demonstrated by Turhan et al. [22].Therefore, it is reasonable to employ k-means algorithm to fast select or filter instances from other software projects as training dataset for identifying fault-prone modules of the new project.

Maximum Correntropy Criterion.
Correntropy is wellknown in information theoretic learning for its ability of processing non-Gaussian noise [25,26].Liu et al. [24] derived it from a correlation function in random processes and stated that correntropy has a close relationship with the information potential (IP) of Renyi's quadratic entropy and m-estimators.Based on IP, correntropy is defined as a roust local similarity between two arbitrary random variables  and : where (⋅) denotes the expectation operator and   (⋅) is a kernel function that satisfies Mercer's theory [32].From (4), it can be seen that correntropy is different from classic kernel functions since it works with each pair of samples independently.Additionally, it can make use of the kernel technique that nonlinearly maps the input data to a higher dimensional space [25,26].Hence, correntropy owns several important properties with the robust theoretic foundation, such that it is symmetric, bounded, and positive [24].
In practice, we often only obtain a finite number of pairwise data {(  ,   )}  =1 , of which the joint probability is unknown.This leads to the sample estimator of correntropy: The Gaussian kernel () ≜ exp(− 2 /2 2 ) is generally selected as the kernel function   (⋅).Thus, we can rewrite (5) as Then we define the error   in adaptive systems as   =   −   , and the maximum of error   in (6), is called the maximum correntropy criterion (MCC) [24], where parameter  will be specified in the later sections.
Compared with the mean square error, a global metric, the MCC is local, which means that it is faster to find the optima of correntropy value over the search domain, i.e., along the line  = .Moreover, MCC is of capability in Gaussian or non-Gaussian noise environment [25,26], which indicates that MCC may be an efficient tool for SDP since most of the variables (such as software metrics) in software engineering follow the nonnormal distributions (see Table 7).

A k-Means and Maximum Correntropy Criterion Based Predictor (KMP)
In this section, we propose a new CPDP called KMP, which consists of four stages including instance filtering by k-means, feature selection using correlation analysis, instance reduction based on random sampling, and building a MCC-based classifier to categorize modules of new software projects into FP class or NFP one (see Figure 1).The details of each stage in KMP are explained in the following subsections.

Instance Filtering.
According to the principle of the analogy-based learning, k-means algorithm is used to select similar datasets between the new software project and historical ones across project.More specifically, let each module (or instance) in dataset of a new software project (DNSP) and historical dataset of other projects (HDP) be x  = ( 1 ,  2 , . . .,   )  =1,2,..., without the attribute of class label, where  represents the feature attributes and  = 20 denotes the number of features.We perform k-means algorithm following the step by step procedure shown in Section 2.2.Then let C = { 1 ,  2 , . . .,    } be the final partition of X (x  ∈ X).Thus each module x  in DNSP is assigned to its own cluster   .In this stage, we utilize k-means algorithm to classify each module to its nearest cluster measured by Euclidean distance, which has been verified by Turhan et al. [22] as an effective distance norm, for instance, filtering.Therefore, the cluster   is selected as the potential training datasets for detecting the fault of x  in next stages.
We take the release Ivy-1.1 (see Table 5) as an example to show how to implement instance filtering in detail.All the modules x  ( 1 , . . .,   ) =1,..., 1 with all software metrics and the unknown Bug attribute in the release Ivy-1.1 and historical cross-project datasets x   ( 1 , . . .,   , ) =1,..., 2 with the known Bug from the other 31 releases are employed to implement kmeans algorithm.We set the number of clusters k to be √ as suggested by Jian and Cheng [33] ( ∈ [0.33, 0.50] is found to be experimentally fine).When  = 0.50,  = 56 in this paper since the size of modules is 12505.Table 1 presents the clustering results.

Feature Selection and Instance Reduction.
As suggested by He et al. [27], a simplified feature set can improve the performance of SDP and prevent the fault prediction from a time-consuming process.Therefore, in the second stage of KMP, we attempt to remove irrelevant or redundant features in training data derived from the first stage.Considering that we aim to develop a robust linear classifier based on MCC in Section 3.3, the correlation analysis is employed here to implement feature selection.We measure the correlation between 20 features and the Bug (denoting the number of bugs detected in the module) for each cluster obtained from k-means algorithm to find relevant features and reduce redundant ones.Specifically, a module x  in DNSP is assigned to the cluster   with the historical dataset X  = {x  | ( 1 ,  2 , . . .,   ,   )  } =1,2,..., across projects, where  denotes the Bug and  the number of historical samples.Then we perform correlation analysis between each feature and the Bug within X  .Only features ( 1 ,  2 , . . .,   ) with statistics significance at the 95% confidence level are accepted as relevant features.Moreover, if the size of ( 1 ,  2 , . . .,   ), i.e., , is quite large, say more than 6, the selected features have to be simplified further and ranked according to their own correlation coefficients.Lastly, we select the subset from the top-K of the ranking list as the most relevant feature set for the cluster   , where  = √ ⋅lg  recommended by Liu et al. [34] and Song et al. [35].It should be noted that the capital  is different from the lowercase  in this paper.Considering the close relationship between feature selection and instance reduction, we describe the approach to the latter one in the same subsection.
In general, the number of NFP instances in each software project is much greater than the one of FP instances.Therefore, we should appropriately reduce the NFP instances to control the quality of training data with the relevant feature set.Although it bears the risk of information loss by removing valuable instances, random sampling is still one of the most widely applied techniques for its efficiency in instance reduction [34,36].We randomly select the NFP and FP instances in each cluster obtained from the first two stages, respectively, assuming that all the modules are generated from the same uniform distribution.The NFP/FP ratio is set to 65%/35% as the terminal condition of random sampling as suggested by Khoshgoftaar et al. [37].
Subsequent to the example shown in Section 3.1, we show how to implement feature selection and instance reduction in detail.From Table 1, it can be seen that the first module of Ivy-1.1 has been assigned to the 29 th cluster.Thus, we perform correlation analysis between each software feature attribute with the Bug using historical datasets x   ( 1 ,  2 , . . .,   , ) in this cluster.The correlation coefficients are presented in Table 2. Unfortunately, none of the metrics is significant at 95% confidence level.Thus we directly rank the features according to their own absolute value of the correlation.Then top-K of the ranking list are selected as the selected feature set for the 29 th cluster.Here, we adopt the recommendation from [35] that  = √ 21 × lg(21) ≈ 6 for 20 software metrics and 1 Bug attribute.Thus, the selected feature subset is {WMC, AMC, AVG CC, CBO, LCOM3, MOA}.Lastly, we randomly select the NFP and FP instances from HDP in the 29 th cluster with the setup of NFP/FP ratio to 65%/35% as the terminal condition.

A Robust Classifier Based on MCC.
For software fault prediction, let L = ( 1 , . . .,   )  be a vector denoting the Bug attribute of each instance from HDP assigned to the cluster   and vector E = (∑   1   , . . ., ∑      )  be a linear estimator of the software defect set L. We attempt to find a proper weight vector  = ( 1 , . . .,   )  so that E becomes as correlated to L as possible under MCC.Thus, we obtain the following MCCbased model: where (⋅) denotes a Gaussian kernel function and ‖ ⋅ ‖ 2 is the  2 -norm.It is difficult to directly optimize (8) for the objective function of KMP is nonlinear.To solve such an optimization problem, the half-quadratic (HQ) technique [38] is utilized in this subsection.The rationale to applying HQ method is its effectiveness in solving information theoretical learning optimization problem [25,26].Then, according to the theory of convex conjugated functions, ( 8) can be solved by the following proposition associated with HQ technique.
Proposition 1 (see [25,26]).There must be a convex conjugate function  of () so that And, for a fixed , the maximum is reached at   = −().
Proof.It can be referred to in He et al. [38].
Substituting ( 9) into ( 8), we obtain a new objective function in the multiplicative HQ form: where p = ( 1 , . . .  )  denotes the auxiliary variables brought by HQ optimization.Thus we convert maximizing   () to maximizing the new objective function (10), i.e.,   (, p).According to Proposition 1,   can be computed in an iterative way ( denotes the ordinal number of iterations): When  +1  is given, the analytic solution of  in (10) can be determined by where F = {  } =1:,=1: is the (most) relevant feature set of cluster   and P +1 is a diagonal matrix whose element   =   .If there are larger errors   − ∑  =1     in (11), the maximization function   (, p) will control these errors by giving proper weights in (12) so that the errors provide small contributions in (10).As a result, the optimal solution of   (, p) is robust to large errors.Next, we demonstrate the convergence of   (, p).Proposition 2. The sequence {  ( +1 , p +1 ),  = 1, 2, . ..} derived from ( 11)-( 12) converges.
Then we illustrate how to classify an unknown module from DNSP.For each cluster   , let F = ( 1 , . . .,   )  be the (most) relevant feature set and  = ( 1 , . . .,   )  be its associate weight vector derived from (12) in an iterative way.We set  = 1 of (10) throughout the paper, and the kernel size  is determined using Silverman's rule where   denotes the standard deviation of the error; i.e.,   − ∑  =1     , and  is the error interquartile range.The module x  = ( 1 , . . .,   )  of our interest will be categorized into the FP class if where  is a constant ( = 0.5 is found to be experimentally fine); otherwise, we classify x  to the NFP class.Subsequent to the example presented in previous sections, we show how to perform prediction using MCC-based classifier here.The randomly selected training dataset with the feature subset {WMC, AMC, AVG CC, CBO, LCOM3, MOA, Bug} (see Section 3.2) is utilized to train the MCCbased classifier based on (10)- (12).We can obtain the weight vector using an iterative way:  = (0.0298, 0.0394, 0.0297, −0.0409, 0.0015, 0.0015)  .
Therefore, we categorize the first module into NFP class, which is consistent with the fact.10)-( 15) in an iterative way; Predict whether the module x  is FP or NFP based on Eq. ( 16); 10: end for Algorithm 1: KMP algorithm.

Pseudocode of KMP.
For readers' easy understanding, we present the pseudocode of the whole four-stage framework, i.e., KMP, in Algorithm 1.

Experimental Setup
This section presents the description of the primary work which has to be done before testing the effectiveness of our new predictor.First, we put forward two research questions (RQ) concerning the MCC-based classifier and KMP, respectively.Second, we illustrate how to measure the performance in our experimentation.Last we overview He et al. 's approach [27] to SDP across projects.The rationale to select this method is that it has produced more acceptable results than other CPDPs [8,22,27].

Research Questions.
Our experimentation is implemented using nine software projects with their 32 releases, more details of which are shown in Table 5.It should be noted here that the selected datasets have removed their unique ID and zero lines.Additionally, we have utilized a "log-filter" to all numeric values  in these software instances with ln (+1) for data preprocessing, which is suggested by Song et al. [39].Then we attempt to accumulate empirical evidences to answer the following RQs.
RQ1: Is the MCC-based classifier better to obtain prediction results compared with other commonly used learning algorithms?First, we evaluate the performance of the MCC-based classifier for SDP and compare it with conventional classifiers such as LR, NB, RT, and SVM under the same scenario, where training datasets are obtained using He et al. 's approach [27].The rationale to design the experimentation like this is that we have to control variables in the whole study.In other words, if we directly select the training dataset derived from the first three stages of KMP (see Sections 3.1-3.2) to test different classifiers, it may be difficult to judge whether the prediction results are improved (deteriorated) by the efficiency of classifiers or the techniques of data preprocessing (e.g., instance filtering by k-means algorithm).Additionally, the first three stages of KMP are designed mainly for the MCC-based classifier.They may be unsuitable for training other classic classifiers and even deteriorate their prediction performance.
RQ2: How do KMP perform in comparison to the stateof-the-art method?
Second, there are two main factors of affecting prediction accuracy for CPDP: the selected training data and learning techniques (i.e., classifiers) applied [27].Therefore, we investigate the performance of the proposed KMP including training data selection and learning algorithm modelling.He et al. 's [27] approach combined with conventional classifiers are also included for comparison.

Evaluation Measures. Different measures have been
proposed for performance evaluation of CPDPs during the calibration, validation, and application phases of software models.In this study, we follow the suggestion from He et al. [8,27] to utilize F-measure to evaluate the effectiveness of different predictors for it integrates Recall and Precision in a single indicator.Recall and Precision are commonly used accuracy indicators: where TP (i.e., true positive) refers to correctly classified buggy classes and FP (i.e., false positive) means nonbuggy classes are wrongly classified to be faulty, while FN (i.e., false negative) denotes that buggy classes are wrongly classified to be nonfaulty.Considering that values of Precision and Recall are, in fact, mutually exclusive [8], we should employ F-measure to investigate the overall prediction performance: The closer the value of  is to 1, the better the model performance is.More details of such measures can be found in [27].

Training Data Selection.
The training datasets derived from He et al. 's approach are the most suitable training data [8] for each testing releases in Table 5.We take the release ivy-1.1 as an example of explaining how to select its most suitable training data.Each combination of datasets across the Ivy project is employed as potential training data (e.g., <ant-1.3>,<camel-1.2>,<ant-1.[27] recommended using the top-K representative metric attributes to predict defects for all projects.The optimal K is determined by the number of occurrences for each software metric.In our experimentation, the occurrences of the top-K metrics are computed by the algorithm proposed in Section 3.3.1 from He et al. [27].There are 5 software attributes with the number of occurrences being more than 15 compared with the total number 32: CBO (22), RFC (21), LOC (20), LCOM (18), and CE (17).Therefore, these five representative software metrics constitute a general feature subset used both in Experiment 1 of Section 5 and by He et al. 's method compared with KMP in Experiment 2.

Machine Learning
Algorithms Utilized for SDP.The most suitable training datasets with the Top-5 software feature subset are used to train four different classifiers, i.e., LR, SVM, RT, and NB, respectively.The rationale to select these machine learning algorithms is that their efficiency has been widely verified in SDP [8,22,27].Then we illustrate the setup of these prediction models as follows.
LR is a probabilistic statistical regression model for SDP by fitting data to a logistic curve [40].It can be applied as a binary predictor to predict a binary response.In CPDP, the labeled variable (either FP or NFP) is binary; therefore, it is suitable for solving the problem.We implemented LR for software fault prediction with a "glmfit" function and "glmval" in Matlab environment in this paper.
SVM is a type of supervised learning model with robust learning algorithms that is typically applied for regression analysis and classification by identifying the optimal hyperplane that appropriately separates samples into two different classes.We performed SVM for SDP with a "svmtrain" function and "svmclassify" in Matlab.
RT is also a supervised learning algorithm [41] based on the simplest hypothesis spaces.A Random Tree has two key factors: a body, which is a set of labeled instances, and a schema, which is a set of features.We implemented RT for software fault prediction with a "ClassificationTree.fit" function and "predict" in Matlab.
NB is one of the most commonly used classifier based on conditional probability [42][43][44].Each feature in this classifier is assumed to be independent, and that is why the classifier is termed as 'naïve' .In practice, the Naïve Bayes classifier can provide better prediction results than more sophisticated classifiers, although the independence assumption is often violated [16,45].The prediction model constructed by this classifier is a set of probabilities.The probability that a new instance is FP or NFP is estimated using the product of the individual conditional probabilities of the feature values for each class.We performed NB for SDP with a "NaiveBayes.fit"function and "predict" in Matlab.
The parameters of all these learning algorithms were determined after repeating experiments for many times, and only the best classification result was selected for comparison.

Experimental Results
In this section, we present the experimental results so as to validate the efficiency and robustness of both MCC-based classifier and KMP.

Experiment 1: Is the MCC-Based Classifier Better to Obtain Prediction Results Compared with Other Commonly Used
Learning Algorithms?In this experiment, we employ the most suitable dataset for each testing software release to train   4 shows that all pairs are significantly different

Discussion
Through series of experiments designed by due care, all of the results have proven the effectiveness of KMP: (i) For instance filtering, the k-means algorithm is proven useful in eliminating redundant instances while keeping relevant ones.The Euclidean distance norm is proved to be appropriate as the similarity functions used for clustering instances.This conclusion is consistent with the findings of Turhan et al.
[22] (ii) For feature selection, the correlation analysis is effective for SDP.The feature is quite suitable when its correlation coefficient is accepted with statistics significance at the 95% confidence level.The features with the great values of the coefficient will be selected (iii) For instance reduction, random sampling is proven useful in the experiments, which conforms to the proposition of Liu et al. [34].The setup of NFP/FP ratio to 65%/35% is recommended as the terminal condition (iv) For the MCC-based classifier, we observe that it is better than the classic models, i.e., NB, SVM, LR, and RT.Moreover, combined with k-means algorithm and proper data preprocessing approach, it is also comparable to the state-of-the-art method.Additionally, its effectiveness is not quite affected by the size of software datasets

Conclusion
In this paper, to combine data preprocessing and learning algorithm modelling simultaneously, we design a new fourstage cross-project defect prediction model, denoted as KMP.
The new method includes instance filtering, software feature selection, instance reduction, and performing prediction of fault-prone class.For the stage of instance filtering, the classic k-means algorithm is employed to find the relatively nearestneighbor data of given software projects from historical cross-project data.The rationale to utilize k-means algorithm for training datasets selection is based on the principle of analogy-based learning.For the stage of feature selection, we use correlation analysis for each cluster derived from k-means algorithm.For the third stage, we utilize random sampling to select training data with proper FP/NFP ratio.For the last stage, a new MCC-based classifier is built to perform software defect prediction.We use the HQ technique to obtain the associated parameters in an iterative way.
To validate our approach, we systematically conduct verification experiments.The state-of-the-art methods are included for comparison.The final experimental results demonstrate the efficiency of MCC-based classifier and KMP in improving the performances of CPDP in terms of the designed research questions.
In our future work, we attempt to optimize the fourstage method in several different ways.First, we aim to find the interrelations between the instance filtering and the feature selection in terms of some significant factors such as optimal number of clusters setting and the interval of instance sufficiency level.Second, we plan to employ many other real-life software projects to evaluate the efficiency of newly proposed CPDP.Finally, we will explore more efficient approaches to select robust training data for the MCC-based classifier.

Figure 1 :
Figure 1: The framework of the four-stage approach.

Table 2 :
Correlation between each software metric and bug.The selected features in the top-K are highlighted.
The dataset x  ( 1 ,  2 , . . .,   ) =1,2,..., 1 in DNSP, historical cross-project datasets x   ( 1 ,  2 , . . .,   , ) with the known Bug  and the sample size  2 , and let X = x  ∪ x   ( 1 ,  2 , . . .,   ).Obtain a partition  = {  | ⋃   =1   = X} provided by k-means algorithm.// Feature Selection // 2: for each cluster   do perform correlation analysis using historical datasets x   ( 1 ,  2 , . . .,   , ) belonging to   ; select the features ( 1 ,  2 , . . .,   ) with statistics significance at the 95% confidence level; 3: if  > 6 then Rank ( 1 ,  2 , . . .,   ) according to their own correlation coefficients; The top-K of the ranking list acts as the final selected features for   ; 4: else features ( 1 ,  2 , . . .,   ) acts as the final selected features for   ; Randomly select instances from   for each cluster   until the FP/NFP ratio is 65%/35%. ( 1 ,  2 , . . .,   ) in DNSP do Use training datasets   belonging to the same cluster   with x  to train the MCC-based regress according to Eqs. ( Input: // Train the Prediction Model with the Training Dataset and Predict // 9: for each module x [27]hould be noted that the best prediction result recommended by He et al.[27]means yielding the results with the highest F-measure value (see Section 4.3).4.3.2.Feature Selection.As shown inTable 5, each instance consists of two parts: 20 static code metric attributes and a labeled variable Bug.As suggested by He et al. [27], we transform the Bug into a binary classification in the following experiments.An instance is NFP only if its number of bugs is equal to 0; otherwise, it is FP.A defect prediction model typically categorizes each instance as either buggy or nonbuggy.Besides this data preprocessing, He et al.

Table 3 :
Wilcoxon rank sum test for F-measure (Highlighted values are statistically significant at the 95% level).All the F values derived from different SDP techniques for each release are used to implement the test.Table3presents the results that we can find a statistically significant difference at 95% level for all the

Table 4 :
Wilcoxon rank sum test for F-measure (highlighted values are statistically significant at the 95% level).RT, SVM, LR, NB, and MCC mean that these classifiers were trained by the training data derived from He et al. 's method.KMP denotes the MCC-based classifier was trained by the framework proposed in Section 3. Based on the step by step procedure shown in Sections 3.1-3.3,wecanobtainthepredictionresults of all the other modules in the 32 software releases.The results measured by the F-measure are shown in Figure3.RT, SVM, LR, NB, and MCC mean that these classifiers were trained by the training data derived from He et al. 's method.From Figure3, we can see that (1) KMP is superior to He et al. 's method in F-measure for it has produced the highest median value and highest minimum value and (2) the MCC-based classifier, combined with the k-means, correlation analysis, and random sampling, can provide more robust prediction results than the one in coordination with He et al. 's method since it has the narrower box in Figure3.Additionally, we performed Wilcoxon rank sum test to see whether He et al. 's method has significant difference from KMP or not.F-measure values provided by different prediction techniques for each release were used to perform the tests.Table

Table 5 :
Descriptions of the selected datasets.

Table 6 :
Descriptions of software metrics.

Table 7 :
The results for the Kolmogorov-Smirnov test.The logical value h=1 if it rejects the null hypothesis that the data of the interest follow normal distribution at the 0.05 significance level and h=0 if it cannot.