An Empirical Study towards dealing with Noise and Class Imbalance issues in Software Defect Prediction


 The quality of the defect datasets is a critical issue in the domain of software defect prediction (SDP). These datasets are obtained through the mining of software repositories. Resent studies claims over the quality of the defect dataset. It is because of inconsistency between bug/clean fix keyword in fault reports and the corresponding link in the change management logs. Class Imbalance (CI) problem is also a big challenging issue in SDP models. The defect prediction method trained using noisy and imbalanced data leads to inconsistent and unsatisfactory results. Combined analysis over noisy instances and CI problem needs to be required. To the best of our knowledge, there are insufficient studies that have been done over such aspects. In this paper, we deal with the impact of noise and CI problem on five baseline SDP models; we manually added the various noise level (0 to 80%) and identified its impact on the performance of those SDP models. Moreover, we further provide guidelines for the possible range of tolerable noise for baseline models. We have also suggested the SDP model, which has the highest noise tolerable ability and outperforms over other classical methods. The True Positive Rate (TPR) and False Positive Rate (FPR) values of the baseline models reduce between 20\% to 30\% after adding 10% to 40% noisy instances. Similarly, the ROC (Receiver Operating Characteristics) values of SDP models reduces to 40% to 50%. The suggested model leads to avoid noise between 40% to 60% as compared to other traditional models.


Introduction
Software defect prediction (SDP) [1,2,3,4,5] attempts to identify most likely faultprone modules in the software project by utilizing software metrics [6,7,8]. It is always advisable to carefully and meaningfully execute testing of fault-prone modules rather than treating all the modules in a similar manner. SDP models make use of bug reports for representatives in old software that indicate faulty and non-faulty modules. Module's metric information of a software system [9] is used for the training of defect prediction models. SDP models may make use of change-log of software configuration management documentation, as the change-log indeed reports the modules that experience change upon correction when faults being detected. SDP models that use well-known traditional classifiers as a classification technique to predict buggy or clean modules we called them classical SDP or traditional SDP models. All the classifiers that we have used for experimental purposes are classical SDP models. These SDP models and their variants are widely applied in the SDP domain, so they are also baseline methods in defect prediction. We have used five baseline methods for the experimental purpose discussed in section 4.3.2. The links between logs and bug reports may be inconsistent because of few reasons [10] and may also cause mislabeled data. Therefore, quite likely that an SDP model may be working with noisy data and leads to erroneous results. When the cardinality of one of the classes is much smaller than the other class, the dataset is said to be imbalanced data. [11] reported an analysis over combined study on both noise and class imbalance (CI) problems in software quality. They have conducted experiments over eleven classification techniques and seven sampling methods over public datasets. They concluded that few classifier combined with sampling methods that are most confronted over the noisy and imbalanced dataset. To the best of our knowledge, the combine interaction between CI and noisy instances still has limited research. Limitations of isolated studies evaluation between CI and noisy instances are presented as follows: (a) It won't be easy to find the concurrent impact of both challenges over defect prediction models; both these problems degrade the performance of SDP techniques. (b) Dealing with noisy instances only helps in suggesting the percentage of noisy instances a model can tolerate. In contrast, studies about the CI problem only helps in recommending the rate of imbalance in the dataset that a predictive model can digest. (c) The common approach that can conquer both the challenges together cannot proposed. (d) The trade-off ratio between the percentage of noisy instances and the CI rate can not be explored.
After our empirical study, we have listed of few compelling motivational queries about the requirement of combined analysis between noise and CI problem in software defect prediction are shown below.
(i) Many of the software practitioners and researchers claimed over the quality of defect prediction datasets, Shepperd et al. [12] claimed many instances were noisy in NASA data repository. Joint analysis of CI problem and noise can help to study the relative impact between classifier and sampling methods over noisy instances; as limited studies were reported over the interaction of sampling and classifiers over noisy data. (ii) How classifiers interact with sampling methods? Do certain sampling methods outperform when simultaneously used with specific classification algorithms over noisy datasets? (iii) A combined study of sampling methods & classifiers and their performance analysis over various SDP models are still unexplored at different noise levels.
Researcher either suggested an SDP that dealt with CI problem or noisy instance but not both, but we proposed an SDP model that address both issues. Apart from this, we analyzed the tolerable noise capability of existing SDP models, i.e., after adding noise, the performance of the SDP model remains unchanged. For the meaningful treatment of the study approach, we framed five research query (RQ) based on evaluation metrics that may guide the attempt to proposition of a model. These RQ justify the observations of our empirical study. List of research queries (RQs) are follows. RQ-1: What are the effects of noise on True positive rate (TPR) and False positive rate (FPR) over classical SDP models?

RQ-2:
To what extent the suggested model is resistant over the various level of noise compared to the other classical SDP models? RQ-3: What is the range of tolerable noise in baseline defect prediction models? RQ-4: How does the class imbalance problem affect the performance of various SDP models over different noise levels? RQ-5: Compare the performance of proposed approach with other classical SDP models without applying sampling method. All six RQs can explore the circumstances under which classical SDP models works over noisy data. Noise tolerant ability in defect prediction still has a scope of research. We have conducted similar experiments for noise handling like [13] done in their article; apart from this, we also dealt with CI issues. [14] also explore the challenges of mislabeled data, which leads to inconsistent results. [15] suggested an approach JIT-SDP that makes defect predictions at the software change level, and they presume that the characteristics of the problem remain constant over time. The article makes the following contributions.
(i) Combined empirical studies between noise and CI problem. The article also evaluates the impact of these two problems in the performances of baseline methods. (ii) The article also analyzed the various tolerance level of noise and CI problems over baselines methods. (iii) Suggested SDP model that can tolerate maximum noise degree and circumvent class imbalance issues. The suggested approach is mainly a change in a buggy prediction model. These two are the most prominent challenges in any SDP technique. We have tested the significance level of the suggested approach using TPR, FPR, F-measure, Precision, and ROC compared with other traditional SDP models. (iv) We conducted 864 experiments over 3 public datasets using 5 classifiers and 1 sampling method. We also dispense a few guidelines for noise tolerance level and CI issue in baseline SDP models. Those guidelines will assist in better SDP models in the future.
In the next section 2, we have discussed the related work, followed by the background details in section 3; then after we illustrate the experimental procedure, & suggested approach in section 4. In last sections, we have analyzed and discuss results of our experiments in section 5. Afterward, we talk about threats & validity in section 6. In the final section 7, we present conclusions drawn from the article.

Related work
A given software module consists of source code and other software metrics, SDP classifies the module either clean or buggy. SDP classifies a module as either clean or buggy, whereas a given software module consists of software features, e.g., source code metrics; few existing SDP methods are SVM [16,17], Naive Bayes [18,19], Random forest [20,21], AdaBoost [22,23], J48 [24,25], etc. Recently a few other ensemble learning [23,26,27] and deep learning-based [28,29,30] defect prediction architecture have been reported. [31] suggested an interesting approach using dimension reduction of different software metrics, they have suggested an approach using tangent based SVM. There are several unsupervised and semisupervised machine learning methods that are also applied in the SDP domain. Abaei et al. [32] proposed a semi-supervised based approach using hybrid self-organizing map (HySOM), the model has the ability to predict defect-prone module in an unsupervised manner.
They performed experiments using NASA dataset and found improvement over existing methods. [33] performed a comparative analysis between performances of various semi-supervised methods. A semi-supervised method is proposed by Lu et al. [34]; they found the proposed model significantly better over the random forest. Metric driven software quality prediction model was proposed by Catal et al. [35]; their method can be used when defects are absent; it does not require information about a number of clusters before the clustering phase. They found the proposed model significantly outperforms existing methods. [36] proposed an SDP model called ACo-Fores, which addresses the problem of inadequate availability of historical dataset. They used the PROMISE dataset for experiments and found optimal results compared to other state-of-the-art methods. Similar work has been performed by [37]; they investigated Expectation-Maximization (EM) algorithm for software quality prediction. They used NASA dataset and found EM-based prediction model improves generalization performance. SDP techniques are suffering from two main challenges; thus, we have categories related work into that fashion. The quality of the data in first categories, and second about the class imbalance issue.

Quality of defect dataset
In few studies [5,38], researchers had claimed the existence of some errors in large datasets, like field error, the rate of field error is nearly around 5% [39,40]. [41] tried to handle noisy data and also tried to overcome from an over-fitting [42] problem. [12] also raised the question on the quality of data of NASA repository, but still, a few powerful machine learning-based defect prediction models are available such as [19,43,44,45]. There are two different types of noises in a defect dataset, and both of these noises [46] affect the performance over machine learning algorithms, first is class noise and second is feature noise. However, we have only considered class noise in this article. Class noise is an interchange of the class label from clean to buggy or buggy to clean or both, due to any consequences. This problem leads to inconsistent results. [47] concluded that few defects are not found in commit logs of a dataset, and hence, they are also not visible in automated linking tools. [48] found that more accomplished developers are more likely to direct links between issues report to code change. [49] investigated the influence of SDP models by inducing artificially generated defect dataset. Catal et al. [50] conducted a study over class and noise detection; they proposed a detection algorithm based on software feature threshold values. Riazz et al. [51] proposed a two-stage data preprocessing methods that incorporates the feature selection and noise filter execution; they employed Knearest neighbor and ensemble learning in their proposed approach. Alan et al. [52] proposed an outlier detection approach using metrics threshold and class label; they employed NASA datasets to identify class outliers; they found the proposed model outperforms over baseline methods.

Class imbalance problem
The class imbalance issue may produce biased result towards the negative instance [53]. A comprehensive study about the class imbalance in SDP is recently done by Song et al. [54]. Few studies [55,56] compared results produces from imbalance and balance class labels, but there are few researchers who proposed some solutions that have increased some accuracy of SDP models. Researchers [57,58,59] have proposed random subsampling [60], SMOTE [61], class balancer [62], and spread subsampling [63] techniques, which help to avoid class imbalance issue and provide unbiased results. Joon et al. [64] performed a combined study over class imbalance, feature selection, and simple noise removal strategy over public datasets; they used precision, recall, f-score, roc, and accuracy as performance measures.

Background
The general framework of a software defect prediction model is shown in Fig. 1. Software repository consist two segments [65]; version control system (VCS), and issue tracking system (ITS) as shown in Fig. 1. Most of the time software practitioners use both of them because version control systems (source code repository system) are unable to store bugs. As figure reports the instance are generated from software repository. These instance are made up of software metrics, the data cleaning and other preprocessing are required to build the training set. Then training set are fed into the trained/untrained model that can classify buggy or clean module. We will provide detailed discussion about preprocessing and trained model in the later sections. Before designing any prediction model, we need to create the prediction target, i.e., class label. Software modules consist of software entities such as file [66], component [67] or change [68], SDP model is intended to predict software module as either buggy/defective or clean/non-defective. There are two types of defect prediction Fig. 1: General framework of software defect prediction model. models, buggy file prediction and change buggy prediction. The detail description are given below.

Buggy files prediction
Identification of the buggy files in advance helps the development team leader to properly and optimally allocated resources, and it leads to minimizing the testing effort. As we know, some of the internal properties of a software system, such as software metrics, and have associates with the external property such a fault-proneness of a module. This kind of SDP model mainly identifies software features that are expressed in a defect dataset. This classification model learns from historical data and predicts the fault-prone modules in a test data. A lot of software features are responsible for this kind of SDP models such as resource metrics [69], process metrics [70] and cyclomatic complexity metrics [71].

Change buggy prediction
When some new changes are introduced in a software modules, the change buggy prediction predicts whether the changed software module are buggy or not, and it learns from change classification (CC). Let us say a module consist n files and suppose a new file is added to it, so there are total n+1 files are present in the module. Now, this n+1 files to the module may cause the software faulty. It mainly involves two source code revisions, an old revision, and a new revision. This change in several files is related to metadata, which includes author, change-log, date of commit, etc. After mining the change history; it can derive the co-change count, which indicates, for how many files changes, the system will remain clean or buggy. [72] illustrated this process in their article.
To build any of the two types of SDP models as defined above, requires class labels (buggy/clean) and various features. Model fitting using mislabeled data may cause incorrect results. In this direction, we proposed a change buggy prediction model by using public data. In the next section, we will discuss its experimental details, performance measure, and build a useful SDP model which can tolerate noisy instance up to some extents.

Experimental procedure and suggested approach
In this section, we will illustrate the experimental details, dataset description, noise addition phenomenon, performance metrics, preprocessing, classification techniques, and the suggested approach.

Dataset description
The three public datasets are Columba, Scarab, and Eclipse that we have used in our experiments for the buggy change prediction model, as detailed shown in Table  1. These are classical datasets and have substantial training instances, as shown in Table 1, compared with other open-source datasets, which lead to satisfactory results. Although [13] uses these datasets in their experiments, they also conducted similar experiments by adding noise manually into the datasets, so we are extending their experiments.

Noise added in dataset
We assumed datasets that we are using are pure, i.e., there are no noisy instances. Kim et al. [13] also considered similar assumptions over these three datasets. So we injected some percentage of noisy labels into it and then exercised its training using various SDP models. Now, with the interchange of the class label from buggy to clean and clean to buggy that introduced class noise in the defect dataset. Then we evaluated the performance of various SDP models over different noise levels. To analyze the performance of various classical SDP models, we have added noise in the defect data from 0% to 80%. 0% means no noise added, whereas 10% means, 10% of the total instance has been selected and interchange of the target label. It is the proportion of actual positive instances that are correctly classified; It is also known as Recall. More the TPR better will be the model.
It is the ratio between the number of negative instances wrongly classified as a positive instance (false positives) to the total number of actual negative instances (regardless of classification). Low the FPR more effective the model is and vice-versa.
F P R = F P F P + T N (c) Precision: It is the ratio of relevant instances out of total instances. The relevant instances are those instances that are required for the classification.
P recision = F P F P + T N (d) F-measure: It is the harmonic mean of precision and recall (TPR). Its value lies between 0 to 1, 0 implies the worst result and 1 implies the best result. It is also known as f-score and F 1 score.
F − measure = 2 * P recision * Recall P recision + Recall (e) Receiver Operating Characteristics (ROC) curve: It is an area under the curve of TPR and FPR. Its value lies between 0 and 1. The value 0 shows that there is no correct classification, 0.5 shows random classification and 1 for 100% correctly classified instances. It is mainly for diagnostic ability of a binary classifier [74,75,50].

SDP models for experiments
We added noise levels from 0% to 80% and then trained various models. We analyze the performance of defect prediction models over different noise levels. The other major challenge in any defect datasets is a skewed distribution of a particular class causes class imbalance problem. To deal with such a problem, few researchers applied sampling methods that establish the balance between positive and negative classes.
In the next sections, we explain the preprocessing methods and then various baseline models.

Preprocessing Techniques
In the preprocessing stage, after data cleaning, the feature selection and sampling methods are the major steps. The steps involved in preprocessing are shown below.
1. Sampling technique: Table 1 reports the datasets have skewed distribution and suffer from class imbalance challenge. To avoid CI problem we have applied random undersampling methods [76]. We have also tried other sampling methods, e.g., class balancer [77], synthetic minority oversampling technique [61], and spread subsample technique [78], but achieved optimal results over the random sampling technique. We considered the algorithm from [79], which deletes random samples of the majority class label (SetLabels). The full description of the algorithm is given in [79]. The main objective of this algorithm to achieve uniform distribution of buggy and non-buggy class labels. 2. Feature selection: As Table 1 reports, each dataset has a massive number of features. So we need to select relevant features for better analysis. We have used Information gain [80] as a feature selection method and Ranker method as search technique [81,82] for feature ranking. It is mainly entropy based method; it is defined as a amount of information provided by the selected item for categorization.
It is calculated by how much an item's information is important for classification, in order to compute the importance of lexical items for the classification problem. The ranker method uses conjunction with attribute evaluator (Entropy, Gain ratio, etc.). It has three parameters, P, T, and N. The P (start state) specify the starting set of the attribute, specified attribute are ignored during ranking. T (threshold), the threshold is specified by which features are ignored. N (Number of selection); it specified the number of attributes selected.

Classification techniques
We have used five different classification techniques while applying the same preprocessing methods (discussed in section 4.3.1). These classifiers build five different SDP models. The list of classification techniques is shown below.
(i) Naive Bayes (NB): It is a probabilistic classifier [83,84], which is derived from Bayes theorem. It is a family of algorithm which shares a mutual principle. Every pair of features is classified as independent of each other. The underlying assumptions are features makes an equal and independent contribution to the outcomes [85]. We have used batch size = 100, set "doNotcheckCapabilities" and "kernelEstimator" as "False". (ii) Least Square Support Vector Machine (LSVM): It is a supervised learning algorithm [86]. It can be used for both classification and regression problems. It is a binary classifier which creates n-dimensional hyperplane to classify the instances [16]. We have used radial basis function kernel in our experiments. We used batch size = 100, catch size = 40, cost = 1, degree = 3, loss = 0.1, nu = 0.5, and seed = 1. (iii) J48: It is variation [87] of the C4.5 algorithm; it is a decision tree based classification algorithm which used to create Univariate Decision Trees (UDT) [43]. The leaf node will decide the instance belongs to which category; it mainly calculate the information gain of each attribute, and select the attribute with max info gain. We have used batch size = 100, we set "binarySplits" as "False", "collapseTree" as "True", no. of folds = 3, seed = 1, "unprunned" = "False", and "useLaplase" = "False". (iv) AdaBoost: It is short of Adaptive Boosting [44,22], which is mainly an ensemble learning technique. It combines different weak learners into one model and combines the results of each weak learner. That makes the classifier more powerful.
As it is an ensemble learning technique; it overcomes the over-fitting problem.
It is also an ensemble learning technique [88]. Algorithm 1 shows the pseudo code of RF learning algorithm. We considered this algorithm from [88], complete discussion about RF can be found in [88]. There is a function in RF algorithm called "RandomizedTreeLearn" which mainly returns the learned tree. It is a decision tree based learning algorithm; it is one of the most robust SDP model [89,90]. We have used batch size = 100, "breakTiesRandomly" as "False", "ComputerAt-tributeImportance" as "False", "no. of slots" = 1, and seed = 1.
at every node f ← is small subset of F; Split of best best feature in f; return Learned Tree end function Before applying the learning technique, we split the dataset into a training set and testing set. Where 70% for the training data and 30% for testing data, we have also performed other split ratios but got optimal results on a 70%-30% ratio. Then we have used ten-fold cross-validation [91] on training set in each classifier. It avoids the possibility of an over-fitting problem [92] in the classification model.

Suggested approach
We have given the sequence of procedures regarding experiments in Algo. 2. The Underlying architecture of the suggested approach is shown in Fig. 2; it reflects each phase of the suggested model. Noise is added using the mislabeling of the class label, as shown in Fig. 2. We have injected various noise level in a dataset, we have tested the endure noise level in change buggy prediction model. We have applied information gain as an attribute selection method and ranking method as a search method to rank the attribute and select the most relevant attribute (see section 4.3.1). We have utilized random undersampling as a sampling technique to address the class imbalance problem. We have also tried a few other (SMOTE, Class Balancer, Spread Sub-Sampling, etc.) well know sampling techniques, but the best results came from random undersampling technique. After preprocessing, we have split the dataset into training set and testing set, 70% for training, and 30% for testing (see Fig. 2). After that, we have applied the tenfold cross-validation technique on training data. Cross-validation [91] also avoids the over-fitting [92] and makes the better prediction model. To avoid random bias each experiment has been performed ten times and taken the mean value of each performance measure. The basic architectural view of the suggested approach shown in Fig  2. Random Forest is applied as a classifier, as shown in Fig 2. We have performed similar prepossessing step for the pure set, i.e., 0% noise and compared the performance P 1 and P 2 , as shown in the algorithm 2, here P 1 performance at 10% to 80% noise level and P 2 at 0% noise level. In the next section, we will discuss the performance of various SDP models after / * PRC, MCC, ROC and F-measure * / end // classifier applied withot adding noise / * TPR, FPR, Presision, F-measure, & ROC * / end end end Compare P 1 and P 2 / * compare performance of models with and without adding noise * / adding different noise levels from 0% to 80%. Besides, we also test the tolerable noise in the suggested architecture without applying the sampling technique for all five baseline methods.

Results and analysis
We have experimented with, and without noisy instances in datasets, the noise has been added from 10% to 80% in all three datasets. We have also conducted experiments with and without applying the sampling method at various noise levels and evaluating the performances. In this section, we will address all five research queries (see section 1), and also justify the conclusion corresponding to experimental results. Justifications of every RQs are shown below.
5.1 What are the effects of noise on True positive rate (TPR) and False positive rate (FPR) over classical SDP models?   As we can see in the Table 2 & Fig. 3 the TPR of all baseline models over the various noise levels. Whereas the Table 3 & Fig. 4 reports the FPR of various SDP models over different noise levels. On Eclipse data, the RF-based model exceeds its performance over other defect prediction models. On pure data, the TPR value is 0.975, whereas the lowest TPR value produced by the proposed model is 0.844 at 60% and 80%. The worst performance processed by NB-based model over pure Eclipse data is 0.854. The AdaBoost was the most variant model when the noise level increased from 0% to 80%, and its TPR values started decreasing from 0.9 to 0.489. FPR value of all five defect prediction models is shown in Table 3. The lowermost FPR value for pure Columba data (see Fig. 4 . Fig 3(b) reports RF-based SDP models have least deviated, i.e., the TPR value at 0% noise is 0.893, and 80% is 0.877, which are close to each other. Even for the other models, TPR values fluctuated. We can see in Fig. 3(a) that RFbased model has TPR range from 0.975 (0% noise) to 0.844 (80% noise). Although till 20% of noise level; its value is 0.923, which is close to 0.975. The proposed model over Scarab data has the highest tolerable capability, as shown in Fig. 3(c) and Table  2, at 0% and 70% the TPR value is 0.890, and 0.878 respectively, approximate equal value. Whereas no other methods have that much ability to tolerate this amount of noise, and they showed inconsistent results. The FPR and precision are productive metrics to measure the efficiency of SDP models.   Even the deviation curve of FPR values at various noise levels is shown in Fig. 4. The FPR value for Columba data at 30% noise is 0.180, which is close to 0.209 at 0% noise; it implies model can tolerate noise up to 30%, as shown in Fig. 4(b). The FPR value of Eclipse and Scarab data processed by the proposed model at 0% noise is 0.458 and 0.111, respectively, as shown in Table 3. The FPR curve deviation for Eclipse data produced by the RF-based model has lest deviation, as shown in Fig.  4(a). Whereas the FPR curve deviation of Scarab data is shown in Fig. 4(c), we can see that the RF-based model has the least deviation, whereas NB-based model has the most deviated curve. FPR values for Scarab data produced by the proposed model at 0%, and 40% are 0.11, and 0.121 respectively, which is close to each other; it implies tolerable deviation till 40% noise.
The precision values of classical SDP models over various noise levels for all three datasets are shown in Table 5. We can see in Fig. 5(b), the precision of RF-based model at 0%, and 40% noise is 0.894, and 0.875, respectively. These two values are close to each other, which implies the model tolerates noise up to 40%. Whereas precision values pure Eclipse and Scarab data produced by the suggested model is 0.970 and 0.890, respectively. The precision value of Eclipse at 20% noise processed by the RF-based model is 0.923, which is close to 0.970 at 0%; it indicates, the model remained unchanged until 20% noise. The precision value of Scarab data at 80%  The J48-based model also has high noise tolerant rate for Scarab data, the precision value is 0.854 at 50% noise level, which is close to 0.870 to 0% noise.

What is the range of tolerable noise in baseline defect prediction models?
After performing experiments, we can conclude that for all five SDP models, the range of tolerable noise is different. Figures from Fig.3 to Fig.7 shows the TPR, FPR, Precision, F-score, and ROC respectively of various defect prediction models under different noise condition. We have analyzed each classifier. The tolerable noise range in NB-based model is from 20% to 30% because TPR values (see Table 2 & Fig.3) and FPR (Table 3 & Fig.4) values remains the unchanged at 30% noise level; it indicates model is stable and tolerable up to 30% noise. Even precision (Fig.5) and ROC (Fig.7) of NB-based SDP for every dataset are fallen after adding noise more than 30%. F-measure (Fig.6) and TPR (Fig.3) values continuously fall down, but up to 30% of noise level, TPR, and f-score values are approximately equal, indicates performance breakdown point at 30% noise. FPR values (Fig.4) are increased when the noise level rises. Still, from 0 to 30%, FPR values are close to each other; it indicates the models are uniformly performed up to 30% noise, but after that model becomes  misclassifying the actual class; which leads to degrade in the performance. SVM is also an effective SDP model, Fig.5, and Fig.7 shows the precision, and ROC values respectively. We can easily see the effectiveness of SVM over every datasets and deviation of SVM over different noise conditions. We can see that precision is fallen over Eclipse and Columba datasets after adding different noise level, but for the Scarab data, the precision first gradually increases than started decreasing. Whereas ROC rises in the early phase for both Eclipse and Scarab datasets. ROC values produced by the SVM defect prediction model over Eclipse dataset increase up to 70% noise and then started decreasing indicates SVM-based model is highly noise tolerable over Eclipse data. For the Scarab data, the SVM-based model degrades its ROC values after a 10% noise level, as shown in Fig.7. It stipulates the SVM model is not stable over the Scarab dataset. Whereas for Columba data, the 20% noise is significant, which means there are no hard changes in ROC values. SVM-based method is efficient to tolerate noise up to 40% for Eclipse data. The TPR values are continuous, falling down for all 3 datasets, as shown in Fig. 3, whereas f-score values are changing stochastically for Columba and Scarab datasets, and started reducing at every noise level for Eclipse data as reported in Fig. 6. FPR values (Fig.4) for Eclipse and Columba datasets, gradually decreases when noise increases, but for Scarab data; it increases till 70% then there is a sudden decrease. It indicates the SVM is stable  and more tolerable over Scarab data. The J48 algorithm uniformly performs when the noise level is between 30% to 40%, because TPR, F-score, ROC, and precision values are approximately equal for all the three datasets. The FPR is suddenly started decreasing when noise is more added in the Eclipse data, as we can see in Figure 4(c); it indicates that J48-based model is inefficient to tolerate noise in Eclipse data after 30% noise. Whereas FPR values increase when the noise level increases, which makes results unpredictable.
AdaBoost-based SDP model is performing least effective over each data when noise increases as shown in Fig.6 & Fig.7. When the noise level is increased to 10%, the TPR and precision started decreasing for all three datasets. Even FPR values decrease for the Eclipse data as shown in Fig. 4. Although in Columba, and Scarab datasets, FPR values increase when the noise level increases; it implies the Adaboost is inefficient to tolerate noise over these datasets; so Adaboost can tolerate maximum noise up to 30%. In our suggested approach, i.e., RF-based SDP model, the ROC, and precision were almost unchanged till 60% to 70% as shown in Fig.7 & Fig.5. f-score and TPR values started decreasing as the noise level increases, but from 40% to 60% of noise, the TPR and f-score are unaffected. The FPR in Fig. 4(a) started decreases when noise increases, whereas in Fig.4(c) & Fig. 4

(b) the FPR increases as per noise increases.
For all three datasets, the noise tolerates capacity by the proposed approach is 30% to 40%.   Table 4 reports the f-score of all methodologies at various noise degree. The maximum f-score processed by the proposed model for pure Columba, Eclipse, and Scarab datasets 0.889, 0.698, and 0.889, respectively. Fig 6 shows the deviation curve about f-score from various SDP models over different noise degree. We can see in Fig. 6 Fig. 6(a). We can see in Table 4 and Fig. 6(c), the highest f-score value for pure Eclipse data processed by NB-based SDP model, is 0.974, followed by RF with 0.968 f-score value. The lowermost f-score of Eclipse data is 0.870 which is produced by AdaBoost-based SDP model. The least deviated curve is of proposed model and J48-based SDP models, whereas the most deviated curve is of NB-based SDP technique, as shown in Fig. 6(a)  good fitted model and outperforms at the high noise level.
In Table 4, we can see the f-score of Scarab data by various methodologies under different noise conditions. The maximum f-score value is 0.889 for the Scarab data, and the RF-based SDP model produces it. After that J48-based model with 0.867 f-score value. Fig. 6(c) shows the deviation curve of all five SDP methods at various noise stages. We can see in Fig.6(c) that RF and J48 have the most consistent results in every noise situation, whereas SVM has the most deviated curve. The f-score value of RF-based model under various noise conditions like 10%, 20%, 30%, 40%, 50%, 60%, 70%, & 80% are 0.889, 0.882, 0.895, 0.885, 0.884, 0.870, 0.878, and 0.883 respectively. SVM, NB, and AdaBoost-based defect prediction models are not effective after high noise levels. Table 6 and Fig. 7 reports the ROC of all five models under various noise levels. The maximum ROC value for pure Columba data is 0.951, which is produced by the RFbased model, as shown in Fig. 7(b). Then after J48-based SDP has ROC value, i.e., 0.843. Lowest ROC produced by SVM-based method with 0.58 ROC value. NB and AdaBoost-based SDP model has a moderate performance with 0.717 and 0.748 ROC values, respectively. The variation of ROC values for Eclipse data, as shown in Fig. 7(a). The maximum and minimum ROC values for pure Columba dataset is 0.967 (RF), and 0.566 (SVM),  respectively. In the Table 6 Fig. 7(c). The maximum ROC value at 0% noise is 0.960, which is processed by the proposed model, followed by J48-based model with 0.889 with ROC value. The ROC curve generated by RF-based SDP is almost uniform with the least deviation compared with other methodologies, indicated RF outperforms over every SDP model at various noise levels.
The boxplot range of performance measures of various SDP models over 0% to 80% noise levels is shown in Fig.8 to Fig. 12. Fig. 8 reports the boxplot range of TPR value at different noise level for all three datasets. We can see in the Fig. 8(a)  to 0.910 (0% noise), for J48 0.813 (50% noise) to 0.930 (0% noise), for RF 0.844 (60% noise) to 0.975 (0% noise), and for AdaBoost 0.489 (60% noise) to 0.9 (0% noise). Fig. 8(b) shows the boxplot range of TPR over various SDP models with corresponding noise value for Eclipse dataset. As Fig. 8 reports, the range of NB-based SDP is from 0.612 (40% noise) to 0.854 (0% noise), range of SVM-based model is from 0.618 (50% noise) to 0.910 (0% noise), range of J48-based model is from 0.813 (50% noise) to 0.930 (0% noise), RF-based defect prediction model is from 0.844 (60% noise) to 0.975 (0% noise), and the range of AdaBoost is from 0.489 (60% noise) to 0.9 (0% noise). It indicated that most of the models optimally performed at 0% noise, whereas performance degrades as noise increases. As the quartile range of TPR value produced by Adaboost over Eclipse data is maximum as shown in Fig.  8(b), that indicates the model is unstable and misclassified instance. Fig. 8(c) shows the boxplot range of TPR for Scarab dataset by four baseline models at various noise levels. TPR range by NB-based SDP is from 0.616 (60% noise) to 0.732 (10% noise), the range of SVM-based SDP is from 0.714 (70% noise) to 0.852 (10% noise), range of J48-based model is from 0.826 (60% noise) to 0.867 (0% noise), for RF-based SDP, the range is from 0.870 (60% noise) to 0.890 (0% noise), and for AdaBoost-  based technique, the range lies between 0.660 (80% noise) to 0.783 (0% noise). Fig. 9 shows the boxplot range of FPR values for four baseline method at different noise level for all three datasets. The boxplot range of FPR values over Columba data is shown in Fig. 9(a). The FPR range of NB-based model lies form 0.317 (60% noise) to 0.395 (20% noise). As the quartile range of FPR value produced by SVM-based method over Eclipse and Columba datasets is maximum, as shown in Fig. 9(b) and Fig.9(a), respectively, that indicates the model is unstable and misclassified instance. SVM-based model the lies from 0.383 (60% noise) to 0.611(10% noise), J48-based technique the FPR value lies from 0.168 (60% noise) to 0.216 (0% noise), RF-based defect prediction model the value lies from 0.146 (60% noise) to 209 (0% noise), and for AdaBoost defect prediction model FPR values lies between 0.402 (60% noise) to 554 (80% noise). The FPR range for Eclipse data is shown in Fig 9(  , and for AdaBoost model FPR range is from 0.217 (0%) to 0.470 (50%). The quartile range of FPR value produced by Adaboost-based SDP model over Scarab data is maximum, as shown in Fig. 9(c); it reports that the model is least stable and misclassified actual instances. Fig. 10 shows the boxplot range of F-measure over all three dataset produced by five baselines methods under various noise levels. Fig. 10(a) reports the boxplot range of f-score produced by defect prediction methods. The f-score range for Columba data produced by NB, SVM, J48, RF, and AdaBoost-based SDP models are 0.614 (20%) to 0.68 (0% noise), 0.565 (40% noise) to 0.682 (0% noise), 0.821 (50% noise) to 0.851 (0% noise), 0.847 (50% noise) to 0.889 (0% noise), and 0.570 (20% noise) to 0.709 (0% noise); similarly for Eclipse data 0.616 (40% noise) to 0.974 (0% noise), 0.613 (50% noise) to 0.879 (0% noise), 0.813(50% noise) to 0.924 (0% noise), 0.844 (60% noise) to 0.968 (0% noise), 0.370 (60% noise) to 0.870 (0% noise) as shown in Fig 10(b). The quartile range of f-score value processed by Adaboost-based defect prediction model over Scarab and Eclipse dataset is maximum, as shown in Fig. 10(c), (a) Boxplot range of precision at various noise level for Columba.
(b) Boxplot range of prcsision at various noise level for Eclipse.
(c) Boxplot range of precision at various noise level for Scarab. Fig. 11: Boxplot over range of precision value at various noise level for all three dataset. and Fig.10(b), respectively. It indicates that models are least stable and misclassified the actual buggy instances. The boxplot range for f-score over Scarab data is shown in Fig. 10(c), the range for NB, SVM, J48, RF, and AdaBoost-based SDP models are 0.618(60%) to 0.727( 0%), 0.711 (70%) to 0.852 (10%), 0.832 (70%) to 0.867 (0%), 0.870 (60%) to 0.895 (30%), and 0.621 (60%) to 0.783 (0%) respectively. Fig. 11 shows the boxplot range for precision by all four SDP models, the values were calculated at 0% to 80% noise level. Fig. 11 Fig. 11(b). The boxplot range of FPR value produced by Adaboost-based model over Eclipse data is maximum, as shown in Fig. 11 Fig.  11(b). It indicates that, when the model is unstable or overfitted the range of precision values will be high. When the model is good fit the range is shorter. The shortest range of values is of RF-based SDP model, so it is highly stable and good fit model. Fig. 12 reports the boxplot range of ROC over datasets with various noise level produced by five baselines methods. The boxplot range for Columba data is shown in Fig. 12 Fig. 12(b). The quartile range of ROC value processed by Adaboost-based defect prediction model over Scarab and Eclipse dataset is maximum, as shown in Fig. 11(c), and Fig.11(b), respectively. It suggest that the models are least stable and misclassified the actual true buggy instances. Fig. 12 We have also conducted experiments over datasets without applying sampling techniques at various noise levels. The second column of each Table from Table 2 to Table 6 reports the performance metrics without using the sampling method at a different noise level. Its easy to analyze from Table 2 Table 4. Similarly, the precision and ROC values are even lower at different noise levels produced by imbalanced SDP models compared with sampling-based SDP models, as shown in Table 5 and Table 6, respectively. Although at high noise level (60% to 80%), the sampling-based SDP model misclassifies the actual class in some rare cases, which causes the worst performance. Since in very few cases, the non-sampling based SDP outperforms because the cardinality of buggy instance is more than a clean instance; it implies the model is overfitted towards buggy instances. But sampled based SDP models doesn't overfitted at any noise level.

5.5
Compare the performance of proposed approach with other classical SDP models without applying sampling method.
As we discussed earlier, all three datasets are imbalanced. In this section, we compared the performance of imbalanced baseline methods with an imbalanced suggested approach over every dataset. The TPR value of RF without applying the sampling method is higher than all non-sampling classical methods, as shown in Ta Table 6. Although the maximum Fmeasure value produced by other traditional models without applying sampling technique are 0.748 (J48 at 20%), 0.806 (AdaBoost at 10%), and 0.830 (AdaBoost at 0%), respectively. As reported above, in most of cases, the classical classifiers outperform at 0% noise level. As all the 3 datasets are imbalanced, so it leads to an overfitted model and produces biased results. But RF avoids overfitting [93] up to some extent.

Insightful discussion
When the noise level increases in datasets, the learning technique started misclassified the actual class, and the performance of classical SDP models degrades. As the noise level increases, the number of actual class degrades, and the model becomes predicting the wrong class as an actual class. Although when the sampling method not applied over traditional baseline models, due to the imbalance dataset, the classifiers started overfitting and leads to unsatisfactory outperformed results. AdaBoost is an ensemble learning (EL) method; the EL methods mainly split the dataset and combined the results. EL is also avoids overfitting problem [94]. In a few cases, Ad-aBoost outperformed over RF-based model when sampling technique is not applied; such results are unbiased. When we applied the sampling technique, the proposed model outperforms over each SDP model approximately in every noise level. In very few cases, the AdaBoost and SVM surpass the performance. When noise level increases, the classical SDP models started degrading its performances, because the actual class started reducing and model start predicting the notional classes. RF and J48 are tree-based models, and the leaf node represents the class. RF provides an improvement over other trees model by way of small tweaking that decorrelates the tree. At every split in the RF, the algorithm is not even allowed to consider a majority of the available predictors (possible square root of the full set). RF method uses the square root of total predictors causes better results when the noise level increases. It also offers efficient estimates of the test error without incurring the cost of repeated model training associated with cross-validation, so it's sufficient to avoid notional class and predict the actual class.
6 General discussion and threats to validity We conducted a significant test using the Wilcoxon Rank-Sum test [95] the noise versus clean performance of the proposed model and other SDP models at different noise levels for all three datasets. In table 7, we have listed the ROC value of the proposed model and other optimal SDP models at various noise level. In table 7, we reported corresponding ROC values of the proposed model and other optimal baselines methods at particular noise levels. Obuchowskil et al. [96] suggested that non-parametric testing using ROC is effective over other evaluation metrics. We have taken two samples, in the first sample (S 1 ), we have listed ROC values of the proposed model with increasing order of noise level from 0% to 80%, whereas in Sample two (S 2 ), we have listed the ROC of most optimal SDP model at that noise level in the same order of noise. The hypothesis H 0 is the median (difference) between two samples is 0, and hypothesis H 1 is the median (difference) > 0. The sample size n 1 = n 2 = 27. Based on the information provided, the significance level is α = 0.005, and the critical value for a right-tailed test is z c = 2.58. The rejection region for this right-tailed test is R = z: z> 2.58, where R is the rank sum of sample n 1 , and n 1 is 1082. We got z = 5.873 since it is observed that z = 5.873 > z c ; its concluded that the null hypothesis is rejected. Therefore, there is enough evidence to claim that the population median of differences is greater than 0, at the 0.005 significance level. Few threats to the validity of these experiments are follows.
-We have collected an open-source dataset for our experiments, the types of noise present in open source dataset and software available in a large organization may be different because of data acquisition by different trained employees. It will be better if private industries reveal their dataset so that it can be tested over noise resistance and class imbalance problem.
-We have used public dataset as a pure dataset, but there can be some instances which are not correctly linked, and some defect items are not adequately lined by SCM. It is also possible that few defects may not be recorded by a bug tracking system. -We have not considered feature noise in our study, and this noise also impacts the performance of an SDP model. -As we have randomly added noise in the public dataset by changing class labels, but it can be possible that sound can follow the specified pattern. That pattern can be because of poorly managed data during development. -It is challenging to perform significant analysis between all five performance measures. It needs a multi-variant significant non-parametric test. -We have used TPR, FPA, F-measure, Precision and ROC performance measures which have been widely used in SDP [3,97,98], another threats to validity to our conclusion. -We performed Wilcoxon signed-rank test to investigate the performances made by various approaches; it is a classical method to validate significant improvements over these methods. -In future we plan to reduce threats by performing experiments over other diverse datasets.

Conclusion and future work
Noise and class imbalance problems are the two significant challenges in SDP. We have performed 864 experiments over 3 public datasets and analyzed the noise endure for well know SDP models. We have manually added noise into it from 0 to 80%. We have used 4 baseline SDP methods and trained them using these noisy datasets. We have used random sampling to avoid the class imbalance problem. We also suggested an approach that can tolerate maximum noise and still outperforms over baseline methods. We have also compared the performance without applying sampling methods. We found the proposed approach surpasses the performance over baseline technologies with noisy instances and with imbalanced data. We have also provided a few guidelines. Additionally, we have concluded a few points that are listed below.
(i) We have applied Random sub-sampling as a sampling technique which provides the most effective results compared with other sampling techniques. (ii) Random forest outperforms compared with other state of the art techniques. RF has high noise tolerate rate (30% to 40%) compared with other methodologies. (iii) AdaBoost is least capable, and it has very lesser noise dealing capacity, i.e., from 10% to 20% only. (iv) J48 is also approximately active as random forest and has a higher level of noise dealing capacity in the range of 30% to 40%. (v) The TPR and FPR of RF have the least deviation; however, SVM and AdaBoost have high variation toward the noise. J48 and NB have an average difference after noise is added.
(vi) The f-score and ROC of RF are consistently similar in every noise scenario for all three data. SVM and NB have a high deviation when noise are added. J48 and AdaBoost have moderates deviation. (vii) Naive Bayes and SVM are moderately active and have an intermediate level of noise tolerance ability, Naive Bayes has up 30%, and SVM has up to 40% noise bear level.
We have used public datasets; software industries should reveal their project data so that better data sources can be available for research purposes. Noise dealing algorithms need to be suggested because no such algorithm is present to deal with noise in defect data items.
There is a scope of ensemble learning in software bug tracking systems; it can outperform with state-of-the-art techniques. There is still deep learning-based model is not available till now because of lesser number of instance in a dataset, by applying data augmentation, we can make our training set bigger so that deep learning-based architecture can easily apply. Even deep learning architecture can be used as a feature selection method. Cross defect bug tracking systems can also be helpful for different types of software systems, and we must be careful while combining other metrics and their datasets because it can create redundancy, which affects the performance of the learning model.