Which Process Metrics Are Significantly Important to Change of Defects in Evolving Projects: An Empirical Study

,


I. INTRODUCTION
During software development and maintenance, requirements changes, bug fixes, and code refactoring can lead to software evolution. Software evolution leads to the increasing scale of software, the increasingly complex relationship among functional modules, and the inevitable defects in software [1]- [3]. Software projects in evolution have a great number of versions of program code. With each new version, there is a possibility to introduce new defects or eliminate previous defects.
The associate editor coordinating the review of this manuscript and approving it for publication was Porfirio Tramontana .
In fact, the process of software evolution can be regarded as the process of continuously introducing defects and eliminating defects. Defective software can result in serious economic problems, and even endanger human life. Software defect prediction (SDP) can predict defective software modules, and allocate test resources effectively. Recently, software defect prediction has become one of the research hotspots among academic and industrial organizations [3], [4].
Software metrics are indicators or parameters that describe the characteristics of a software product [5] and are input variables for software defect prediction. The performance of defect prediction will be poor if the metrics are not properly VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ selected, causing defect prediction is of little significance. Therefore, this problem, which software metrics should be selected for software defect prediction, has become one of the research hotspots in the field of software defect prediction. Early on, researchers mainly paid attention on code metrics [5]- [12]. Recently, researchers have put more emphasis on process metrics, such as code churn and people factor, because the code changes generated in software evolution process are the main causes of defects, and process metrics can reflect the software development process and the code changes [13]- [35]. Defect state of a software module includes defective and defect-free, and the change of defect state refers to whether a module introduces or eliminates defects during software evolution, that is, a module introduces defects or eliminates defects from the completion of the previous version to the completion of the current version. For evolving projects, the new class may introduce defects or be defect-free, and the existing class may change from defect-free to defective, or from defective to defect-free, and may still be defective or defect-free. For example, a module is defect-free in the previous version, but is defective in the current version, we can consider it as an instance of introduction of defects. On the contrary, a module is defective in the previous version, and is defect-free in the current version, we can consider it as an instance of elimination of defects. Of course, there are other states as well, for example, a module is defectfree in the previous version, and still is defect-free in the current version, and a module is defective in the previous version, and still is defective in the current version. However, we are more concerned with ''elimination of defects'' and ''introduction of defects'', so we put all the others into one category ''others''. Therefore, the change of defect state of a class can be divided into three categories: elimination of defects, introduction of defects, and others. Eliminating defects indicates that software is evolving in a good direction, and improper evolution leads to introducing defects, which we do not want to observe when software evolves.
Previous researches on software defect prediction tend to start from the perspective of whether there is a defect or not, that is, the classes of classification are defective and defectfree. However, for evolving projects, it is more meaningful to study whether the software module introduces or eliminates defects or not, that is, the change of defect state, because it is more beneficial for us to find the problems in the process of software development for evolving software. And no such work is available in the literature focusing on the change of defect state. Therefore, in this paper, we focus on the change of defect state of software modules, that is, the class of experiment datasets is the change of defect state. Discovering the factors that influence the change of defect state in the process of software development can help us to understand the causes of software defects and improve the quality of subsequent software development.
This paper presents an empirical study on which process metrics are significantly important to change of defects in evolving projects. In general, the contributions of this paper are summarized as follows: i) The existing researches of software defect prediction for evolving projects mainly focused on whether the module is defective or not. This paper focuses on whether the software module introduces or eliminates defects or not. This paper presents an empirical study on which process metrics are significantly important to change of defects in evolving projects. To the extent of our knowledge, no earlier work is available that has explored the factors that influence the change of defect state in the process of software development for evolving projects.
ii) We study on which process metrics are significantly important to change of defects in evolving projects from two aspects. First, we compare the class correlation values among five process metrics by using six class correlation measurement methods, including Pearson Correlation Coefficient, Chi-Square, ReliefF, Information Gain, Gain Ratio, and Symmetric Uncertainty. Second, we compare the classification performance values among five process metrics in terms of four evaluation measures, including Recall, F-Measure, AUC, and MCC, by using five classification algorithms, including Naive Bayes, K-nearest Neighbor, Logical Regression, Multilayer Perceptron, and Support Vector Machine.
iii) Additionally, we conduct the empirical study for the project datasets extracted by Madeyski and Jureczko [19], including 18 releases of seven open source and 19 releases of five industrial software projects. Very limited works are available earlier where these projects have been used for the comparison of process metrics in defect prediction of evolving projects. iv) To evaluate the class correlation and classification performance values among five process metrics, we perform an experimental analysis by using different class correlation measurement methods and classification algorithms. We not only analyze the class correlation values and the class performance values among five process metrics, but also perform statistical analysis of Wilcoxon matched-pair signed-rank test and Cohen's d to verify whether the experimental results are statistically significant and calculate the effect size. The experimental results indicate that Number of Distinct Committers (NDC) plays a significantly important role in the change of defect state including introduction of defects and elimination of defects, and Number of Revisions (NR) is the second, whereas Degree of Code Modification (DCM) is the last. In addition, Average Number of Modified Lines (ANML) is superior to Number of Modified Lines (NML). Based on the experimental results, some suggestions for software development and software defect prediction are also discussed.
The organizational structure of this paper is as follows. The related works of process metrics and defect prediction of evolving projects are discussed in section II. Section III describes the case study in detail. The experimental results and analysis are presented in section IV. The threats to our study are discussed in section V. Lastly, the summary and future works are presented in section VI.

II. RELATED WORKS
In this section, we mainly introduce the research status of process metrics and defect prediction of evolving projects.

A. PROCESS METRICS
We can extract software metrics from databases generated during software development, such as code repositories, defect repositories, and version control systems. Based on the common characteristics of software defective modules, the researchers proposed a variety of software metrics related to software scale, software complexity, and human psychology, which can be divided into code metrics and process metrics.
Code metrics were wildly used for defect prediction early. This kind of researches thought that code size and code complexity which were easy to extract is strongly related to defects in software, such as the LOC metric representing code size [5], the McCabe metrics describing software code structure complexity [6], the Halstead metrics representing code complexity defined by the number of operands and operators [7], the CK metrics of object-oriented programs [8], and the metrics of abstract syntax tree [9]. Many scholars have criticized code metrics. Olague et al. [10] used logistic regression to implement cross-version defect prediction based on a variety of object-oriented metrics, and found that the prediction performance of object-oriented metrics is not ideal. Shepperd and Ince [11] and Radjenović et al. [13] pointed out that using LOC metrics and complexity metrics at the same time may affect the performance of defect prediction, because many complexity metrics are strongly correlated with LOC. Rahman and Devanbu [14] pointed out that code metrics have hysteresis, that is, code metrics may not be changed much after fixing bugs. Consequently, simply using code metrics is not suitable for defect prediction in evolving projects.
Different from the information reflected by code metrics, process metrics directly reflect the software development process and software evolution track. The code changes generated in the evolution process are the main causes of the defects in evolving projects. Recently, researchers have put more emphasis on process metrics. A considerable process metrics are proposed, mainly including: i) metrics based on code change history, such as number of modified lines [13], [19], [20], [23], [28]- [33], and code relative change metrics [13], [20], [27], [30], ii) metrics based on developer information, such as number of distinct committers [15], [19], [20], [23]- [26], [30], [32], [34], experience of developers [16], [34], commit activities of developers [34], project team organizational structure [17], code ownership [34], and organizational dispersion degree [18], iii) development process related metrics, such as number of revisions [13]- [15], [19], [20], [22], [23], [25], [30], number of defects repaired [30], number of refactorings [20], [30], code change complexity [21], [32], and number of historical defects [19], [23]. The most widely used, classical, and defect-related process metrics are Number of Revisions (NR), Number of Distinct Committers (NDC), Number of Modified Lines (NML), and code relative change metrics. The research status of these process metrics are as follows: • NR is a widely used process metric in software defect prediction. Schröter et al. [15] found that NR was more relevant to the number of defects. Graves et al. [22] showed that NR was a better predictor of defect, at least better than LOC. Illes-Seifert and Paech [25] compared the correlation between process metrics and the number of defects. The experimental results showed that NR was strongly correlated with the number of defects, and NR had better defect prediction performance.
• NDC is a controversial process metric. Some researchers thought the introduction of NDC had no effect on improving the performance of defect prediction model. Weyuker et al. [24] compared the performance of the prediction models with and without NDC, and found that NDC could not significantly improve the prediction performance. There are also some researchers who claimed NDC can improve the defect prediction performance. Illes-Seifert and Paech [25] found NDC was highly related to the number of defects, and could obtain great defect prediction performance in terms of predicting the number of defects. Matsumoto et al. [26] analyzed the correlation between a variety of developer metrics and the number of defects, and evaluated the influence of developer metrics on the defect prediction performance with the number of defects as prediction target. The experimental results showed that the introduction of developer related information could improve the defect prediction performance. Kini and Tosun [34] extracted periodic developer experience metrics at file level and commit level, and investigated the explanatory effect of these metrics on defects. The experimental results showed that periodic developer experience metrics extracted at file level were good merits for defect prediction.
• NML is also a widely used process metrics in software defect prediction. Previous studies have shown that NML was strongly related to defects. Nagappan and Ball [28] pointed out that NML had a good defect density prediction performance. Shin et al. [29] also showed that NML had better defect tendency prediction performance. Liu et al. [31] proposed an NML based unsupervised defect prediction model (CCUM) in effortaware JIT defect prediction, and evaluated the prediction performance of CCUM under cross validation, timewise cross validation, and cross-project validation. The experimental results showed that CCUM performed better than all the prior supervised and unsupervised models. Miletić et al. [33] built standard prediction models with and without cross-version code churn. The prediction models were trained on earlier releases and tested on the following ones, and the experimental results showed VOLUME 8, 2020 that the prediction model performed better when crossversion code churn was included.
• The concept of relative was first proposed by Munson and Elbaum [35]. Code relative change metrics here refer to the degree of code change between two adjacent versions, usually represented by the ratio of NML to other software metric. Nagappan and Ball [27] conducted a study on the defect prediction performance of eight code relative change metrics. The experimental results showed that compared with code absolute change metrics, the code relative change metrics could obtain better defect density prediction performance and defect tendency prediction performance. Some researchers compared the defect prediction performance of process metrics with the defect prediction performance of code metrics, and found that the defect prediction performance of process metrics or their combination was better than that of code metrics. Radjenović et al. [13] compared their defect prediction performance, and the experimental results showed that the prediction performance of process metrics was better. Moreover, the better process metrics were code relative change metrics, NR, and NML. Moser et al. [20] compared their defect prediction performance, using 17 process metrics, including NR, NDC, and NML. The experimental results showed that the prediction performance of these process metrics was significantly better than that of code metrics, and the prediction performance of these process metrics was similar to that of the combination of code metrics and process metrics. Graves et al. [22] found that for defect density prediction, the prediction performance of metrics based on code change history was better than that of code metrics. Madeyski and Jureczko [19] conducted an empirical study on identifying which process metrics could significantly improve the performance of defect prediction models, using NR, NDC, NML, and number of defects in previous version (NDPV) as process metrics. First, they analyzed the correlation between each process metric and the number of defects. Then, they compared the performance of the models which used only code metrics with that of the models which used code metrics as well as one of the process metrics. The experimental results showed that the introduction of process metrics could significantly improve the performance of defect prediction, especially NDC. Based on this, Stanic and Afzal [23] conducted comparative experiments on software metrics. The experimental results showed that the prediction performance of process metrics was better than that of code metrics, and there was no significant difference in the prediction performance of different combinations of code metrics and process metrics. Choudhary et al. [30] proposed new change metrics and extracted change metrics from the GIT repositories. The change metrics they studied included NR, NDC, NML, ANML, average number of commits made by each developer, average lines of code worked by each developer, and so on. Machine learning algorithms were applied in change metrics and code metrics to build fault prediction models. The experimental results demonstrated that the use of change metrics in conjunction with code metrics provided better performance than the models that had individual metrics set, and the change metrics had a positive impact on the prediction performance.

B. DEFECT PREDICTION OF EVOLVING PROJECTS
Defect prediction of evolving projects is also called crossversion defect prediction. Defect prediction model is built on the previous versions, and is used to predict the defects in the current version. The classical defect prediction models include Naïve Bayes (NB) [36], K-Nearest Neighbor (KNN) [37], Logistic Regression (LR) [38], Support Vector Machine (SVM) [39], and Multilayer Perceptron (MLP) [40]. The main researches on defect prediction of evolving projects can be divided into the following two types.
Some researches mainly focus on using existing and newly proposed machine learning algorithms to build cross-version defect prediction models, and evaluating defect prediction performance under cross-version scenario. Yang and Wen [41] compared the prediction performance of RR and LAR with other classical algorithms under crossversion scenario in terms of predicting the number of defects. Shukla et al. [42] regarded cross-version defect prediction as a multi-objective optimization problem for the first time, considering both the prediction performance and the cost. The experimental results showed that the multi-objective optimization algorithm had a broad application prospect in defect prediction of evolving projects. Martino et al. [43] used genetic algorithm to search the optimal parameter configuration for SVM to improve the prediction performance. The experimental results showed that the performance of proposed algorithm was better than the comparison algorithms in cross validation scenario and cross-version validation scenario. Liu et al. [44] proposed a recursive neural network prediction model with the sequence of all metrics in the continuous history version as the input. The experimental results showed that in most cases, the proposed HVSM-based RNN model had a significantly better effort-aware ranking effectiveness than baseline models. Rathore and Kumar [45] presented an approach that dynamically selected the best learning techniques to predict the number of software faults. The approach partitioned the validation dataset into different module subset and determined the best learning technique for each subset. For an unseen testing module, the approach determined the subset from the validation dataset that had modules similar to the given testing module, and the best learning technique for the determined subset was the best learning techniques for the testing module. They built and evaluated the presented approach for intra-release prediction and inter-releases prediction, demonstrating the effectiveness of the approach for cross-version software defect prediction.
Other researchers proposed algorithms to solve the problem of data distribution inconsistency between the source dataset and the target dataset in cross-version defect prediction. Active learning was introduced into crossversion software defect prediction to solve the problem of inconsistent data distribution. Lu et al. [46] proposed an approach which took uncertain information as the strategy to select special instances of the current version iteratively, and then determine their classes manually and merged them into the training set. Xu et al. [47] proposed an active learning method based on uncertainty information and information density to select special instances from the current version.
From the research status of process metrics and defect prediction of evolving projects, we can observe that no such work is available in the literature focusing on the factors that influence the change of defect state in evolving projects. This paper presents an empirical study on which process metrics are significantly important to change of defects in evolving projects. The process metrics used in this paper are NR, NDC, NML, DCM, and ANML which are widely used, classical, and defect-related.

III. CASE STUDY
First, we describe the code metrics and process metrics used in this paper. Then, we provide the experimental datasets. Next, we give the data preprocessing operation. Finally, we report the experimental design.

A. SOFTWARE METRICS
Software metrics can be divided into code metrics and process metrics. The former describes the scale and complexity of software source code, and the latter describes the complexity of software development process [14], [15].
Code metrics used in this paper include: i) code size metric (LOC), ii) McCabe cycle complexity metric, iii) objectoriented metric [12]. These metrics can be extracted from the source code by Ckjm 1 tool.
As NR, NDC, NML, DCM, and ANML are the most classical and widely used process metrics, and they are strongly related to defects, we select them as the experimental objects. The first three are the same as those in literature [19], and can be extracted from SVN and CVS by using BugInfo 2 tools. The last two were proposed in literature [27] and [20] respectively, and can be calculated by the ratio of NML to other software metric. The following gives a detailed description of these five process metrics.
• Number of Revisions (NR) is a metric related to the development process, which refers to the total number of versions submitted by software developers to the version control system from the completion of the previous version to the completion of the current version.
• Number of Distinct Committers (NDC) is a metric based on developer information, which refers to the total number of developers participating in the development from the completion of the previous version to the completion of the current version.
• Number of Modified Lines (NML) is a metric based on code change history, which refers to the total number of added, deleted, and modified code lines submitted from the completion of the previous version to the completion of the current version.
• Degree of Code Modification (DCM), obtained by dividing the number of modified lines by the total number of lines, i.e. DCM = NML / LOC, is one of the code relative change metrics. DCM represents the degree of code modification, that is, the average times each line of code has been modified.
• Average Number of Modified Lines (ANML), obtained by dividing the number of modified lines by the number of revisions, i.e. ANML = NML / NR, is one of the code relative change metrics. ANML represents the degree of code modification, that is, the number of modified lines involved in each submission.

B. EXPERIMENTAL DATASETS
NR, NDC, NML, and 20 code metrics of this study are downloaded from the database 3 created by Madeyski and Jureczko [19], whereas DCM and ANML are calculated based on the existing data. We determine the classes of each instance based on whether the previous version has defects or not and whether the current version has defects or not. And the classes of these datasets include ''introduction of defects'', ''elimination of defects'', and ''others''. So, each instance of these datasets is a software module in the current version, consisting of 20 code metrics, five process metrics, and a class, that is, the change of defect state from the previous version to the current version. These project versions in the database meeting following three conditions are removed from the dataset created by Madeyski and Jureczko [19]. i) The first version of each project does not have evolution history and five process metrics, so we do not list those project versions, such as ant-1.3 and camel-1.0. ii) It is unable to collect all five process metrics for some projects, including poi, pbeans, ivy, log4j, velocity, prop-1-192, prop-2-225, prop-3-285, prop-4-347, prop-5-185, and prop-6. So, we also do not use them as experimental objects. iii) Some project versions with very high class imbalance rate have very few instances of introduction of defects or elimination of defects, including camel-1.2 which has only two instances of elimination of defects, 205 instances of introduction of defects, and 558 instances of others, xalan-2.7, and xerces-1.4.4. Such a high class imbalance rate and a small number of minority class will affect the experimental results, so we do not use these project versions as experimental datasets. The remaining projects including seven open source software projects with 18 versions and five commercial projects with 19 versions are the experimental subjects. These software projects are all Java projects and from different application fields. Table 1 lists the detailed information of these experimental datasets. The first to three columns are the project name, version number, the number of all classes in the version respectively. The fourth and sixth columns are the number of the instances from defect-free to defective (introduction of defects) and the number of the instances from defective to defect-free (elimination of defects) respectively. The fifth and seventh columns are the proportion of the instances from defect-free to defective and the proportion of the instances from defective to defect-free respectively.
From Table 1, we can observe that the process of software evolution is the process of continuously introducing defects and eliminating defects. In some versions, most of classes eliminate defects, and a few of classes introduce defects, such as prop-1-44 and prop-2-265. More classes eliminate defects and fewer classes introduce defects indicates that the software is evolving towards a better direction. In other versions, most of classes introduce defects, and a few of classes eliminate defects, such as ant-1.4 and ant-1.6. This shows that these projects evolved improperly. These 37 datasets have the same 20 code metrics and five process metrics, and they have different proportions of introduction of defects and elimination of defects. So, these datasets can be used as the experimental datasets to evaluate the influence of these five process metrics on the change of defect state in evolving projects.

C. DATA PREPROCESSING
Proper data preprocessing can improve the quality of the dataset, and is beneficial to experimental research, so we conduct data preprocessing operations, including data normalization processing and class imbalance processing.
Different metrics have different ranges, which may affect the relationship between each metric and the change of defect state. To alleviate the negative impact of different ranges on the evaluation results, we use Maximum-Minimum method [48] to normalize the values of all metric to [0, 1]. The equation of data normalization processing is as follows: where M ij represents the value of the i th metric of the j th instance after normalization, M ij represents the value of the i th metric of the j th instance before normalization, Max(M i ) represents the maximum value of the i th metric of all instances, Min(M i ) represents the minimum value of the i th metric of all instances.
The experimental datasets suffer from varying degrees of class imbalance problem, and the class imbalance problem can influence the experimental results. We use SMOTE (Synthetic minority over-sampling technique) [49] to handle the class imbalance problem of these datasets. SMOTE generates synthetic numeric values for the minority classes only, to balance the number of the instances of the minority classes with that of the majority class. Generating synthetic values refers to generating new values from the existing instances in the dataset. For example, for each minority class instance X , SMOTE selects an instance X k from the nearest neighbors of X randomly, and then selects a point randomly on the line between X and X k as the newly synthetic instance of minority class. More description of SMOTE can be referred from the work presented by Chawla et al. [49]. The synthetic value X is given by (2).
where δ is a random number between 0 and 1.

D. EXPERIMENTAL DESIGN
The aim of this paper is to compare the importance of a single process metric to the change of defect state, including introduction of defects and elimination of defects, in evolving projects. Figure 1 provides an overview of our empirical study. According to the overview of our empirical study, to study on which process metrics are more important to the change of defect state, we need to analyze the correlation between each process metric and the change of defect state and analyze the classification performance of each process metric for the change of defect state. The following research questions are addressed in this study. RQ1: How is the correlation between each process metric and introduction or elimination of software defects? RQ2: What is the ability of each process metric to classify introduction of defects and elimination of defects?
For RQ1, there are many methods to calculate the class correlation, which can be divided into three categories: i) methods based on statistical theory, such as Pearson Correlation Coefficient and Chi-Square, ii) methods based on instances, such as Relief and ReliefF, iii) methods based on information entropy theory, such as Information Gain, Gain Ratio, and Symmetric Uncertainty. We use six classical methods of class correlation measurement, including Pearson Correlation Coefficient, Chi-Square, ReliefF, Information Gain, Gain Ratio, and Symmetric Uncertainty, which cover all the above three categories, to conduct the class correlation analysis. This experiment is implemented by Weka 4 , a specialized tool for machine learning and data mining, to ensure these class correlation measurement methods are accurate.
Pearson Correlation Coefficient is a method to evaluate the importance of a metric to the classification by measuring the linear correlation between each metric and the classes. Pearson Correlation Coefficient of variable X and variable Y is given by (3).
Chi-Square is a kind of nonparametric statistical value used to verify whether a metric is related to the class distribution. The null hypothesis is that they are not related. Then, the possibility of null hypothesis is measured by calculating the distance between the observed value and the expected value when null hypothesis is established. The greater the distance is, the less possible the null hypothesis is, and the more likely it is that the metric distribution is related to the class distribution. Chi-Square is given by (4).
where r respects different values of a metric, n c is the number of classes, O ij and E ij respect the observed number of instances and the expected number of instances whose metric value is i in class j. Unlike the first two methods, ReliefF does not directly calculate the correlation between the metric and the classes, but gives each metric an importance coefficient which measures the ability of the metric to distinguish the instances of different classes, and then updates the coefficient iteratively. The coefficient is given by (5).
where diff (A, R 1 , R 2 ) is the distance between instance R 1 and R 2 on the metric A, H j is the instance with the same class as S, M j (C) is the instance with the different class from S, m is the times S is randomly selected, and k is the number of the nearest neighbor instances of S selected each time.
Information Gain measures the amount of information brought for classification by the metric about the classes, which is calculated by subtracting the information entropy of dataset S divided by metric A from the information entropy of original dataset S. The value of Information Gain is given by (6).
where H is the information entropy. Information Gain Ratio introduces split information on the basis of Information Gain, which offsets the impact of the number of the metric values on the amount of Information brought for classification by the metric. The value of Information Gain Ratio is given by (7).

GR(A) = IG(S|A) SplitE(A)
Symmetric Uncertainty is a nonlinear correlation measurement method. Symmetric Uncertainty of the metric A and the classes is given by (8).
The process of class correlation analysis of process metrics is shown in Procedure 1. First, two data preprocessing operations are conducted on 37 datasets, including data normalization processing and class imbalance processing (lines 2 to 3). And then, the correlation between each process metric and the change of defect state is evaluated by each class correlation measurement method (line 5). Last, the ranking of process metrics is obtained according to their class correlation (line 6).
For RQ2, we need to construct classification models. Lessmann [4] showed that most of the classification algorithms had similar performance and had no significant difference. Therefore, we select five classical and effective classification algorithms: Naive Bayes (NB) [36], K-nearest Neighbor (KNN) [37], Logical Regression (LR) [38], Support Vector Machine (SVM) [39], and Multilayer Perceptron (MLP) [40] to construct the classification models of introduction of Ranker ← sort process metrics according to class correlation; 7 end for 8 end for 9 End defects and elimination of defects, and verify the consistency of five classification algorithms. Similarly, this experiment is implemented by Weka. For KNN, K is set to 5, and for other algorithms, we use the default parameters of Weka.
In this paper, we use the combination of all code metrics and each process metric to build the classification model. 10 times 10-fold cross validation is used. First, divide a dataset into ten equal parts. Then, select nine of them as training set, the remaining one as testing set, and repeat 10 times to ensure that each part is tested. Next, the average value of 10 times is taken as final performance. Finally, repeat the above three processes 10 times to alleviate the effect of randomness. We select Recall, F-Measure, AUC, and MCC as performance evaluation measures to compare the performance of these classification models. Because we focus on introduction of defects and elimination of defects, eight evaluation measures are used in total, including Recall-Introducedefects, Recall-Removedefects, F-Measure-Introducedefects, F-Measure-Removedefects, AUC-Introducedefects, AUC-Removedefects, MCC-Introducedefects, and MCC-Removedefects. These performance evaluation measures can be calculated by the confusion matrix, as shown in Table 2. Recall refers to the ratio of the number of instances correctly predicted as class k to the total number of instances with class k, that is, true positive rate. It is defined as False positive rate (pf) is the ratio of the number of instances incorrectly predicted as class k to the total number of instances which are not of class k, and is shown as follows: Precision is the ratio of the number of instances correctly predicted as class k to the total number of instances predicted as class k, and is shown as follows: F-Measure is the harmonic average of Recall and Precision, and is shown as follows: Area Under the Curve (AUC) is the area under Receiver Operating Characteristic curve (ROC). The curve describes the relationship between true positive rate and false positive rate. The abscissa represents false positive rate, and the ordinate represents true positive rate. Each point on the curve corresponds to a classification threshold. AUC is not affected by class imbalance as well as is independent from the prediction threshold.
Matthews Correlation Coefficient (MCC) represents the correlation coefficient between actual classification and prediction classification. MCC is calculated from four values in the confusion matrix. It is defined as The values of Recall, F-Measure, and AUC range from 0 to 1, and the MCC ranges from −1 to 1. The higher the value is, the better the performance of classification model is.
The process of classification performance analysis of process metrics is shown in Procedure 2. First, two data preprocessing operations are conducted on 37 datasets, including data normalization processing and class imbalance processing (lines 2 to 3). And then, the process metric to be evaluated is remained, and other process metrics are removed, that is, the combination of all code metrics and each process metric are used to build each classification model (line 5). Last, we conduct 10 times 10-fold cross validation to build and evaluate the performance of each classification model (lines 6 to 20). First, the order of instances is upset, and the dataset is divided into ten equal parts (lines 7 to 8). Secondly, we use nine parts as training data, and the left one as testing data in turn (lines 9 to 11). Thirdly, we use five classification algorithms to train each classifier on training set respectively (lines 12 to 13). Fourthly, we use eight performance evaluation measures to evaluate classification performance of each classifier (lines 14). Fifth, we take the average value of ten folds as the classification performance of each process metric (lines 17 to 19). Last, we repeat the above processes ten times (lines 6 to 20).
As well as comparing the class correlation values and classification performance values, we conduct Wilcoxon matched-pair signed-rank test [50] with 95% confidence interval, a nonparametric test method for two or more related samples, to test whether the difference of class correlation values among five process metrics is significant and whether the classification performance difference of five process metrics is significant. This statistical method has been widely used in SDP [51], [52]. The original assumption is that there VOLUME 8, 2020 is no significant difference among five process metrics when the confidence interval is 95%. If the P value is below 0.05, the original hypothesis is rejected, that is, there is significant difference among five process metrics.
Even though the significance test results show that it has reached the significant level, if the effect size is too small, it also lacks practical value. So, in order to further illustrate the degree of different among five process metrics in terms of class correlation and classification performance, we also apply Cohen's d value [53] to calculate the difference between NDC and other process metrics. Cohen's d is the effect size, which is not affected by the number of samples. It is defined as where µ 1 and µ 2 represent the average value of each sample, and σ 1 and σ 2 represent the standard deviation.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, we not only show the correlation between each process metric and the change of defect state, and the classification performance of each process metric in figures, but also perform statistical analysis on the class correlation values of each process metrics and the classification performance of each process metric respectively, to verify whether the experimental results are of practical value.

A. CORRELATION BETWEEN EACH PROCESS METRIC AND CLASSES
In this experiment, Pearson Correlation Coefficient, Chi-Square, ReliefF, Information Gain, Gain Ratio, and Symmetric Uncertainty are used to measure the correlation between each process metric and the change of defect state. Figure 2 shows the class correlation of five process metrics obtained by six class correlation measurement methods on seven open source software projects with 18 versions and five commercial projects with 19 versions respectively. The abscissa represents the projects in Table 1, and the ordinate represents the class correlation values. According to Figure 2, we can observe that NDC process metric can obtain the highest class correlation values among five process metrics in all class correlation measurement methods and almost all projects, followed by NR, and ANML has higher class correlation values compared with NML, whereas DCM obtains the lowest class correlation among five process metrics.
To further explore whether there is significant difference among NDC and other process metrics on the class correlation, the Wilcoxon matched-pair signed-rank test with 95% confidence interval is applied. If the P value is below 0.05, the original hypothesis is rejected, that is, NDC is significantly better than other process metrics. Table 3 shows the significance test results. In Table 3, the bold P value is below 0.05, and '' '' indicates that NDC is not superior to other process metric.
From Table 3, we can observe that there is a significant difference among NDC and other process metrics for almost all class correlation measurement methods.
In addition, to give a clearer comparison among NDC and other process metrics on the class correlation, we compare the effect size over all 37 datasets between NDC and other process metrics according to Cohen's d, and the Cohen's d result is shown in Table 4. From Table 4, we can observe that compared with other process metrics, in almost all of the class correlation measurement methods, the class correlation of NDC can produce a certain effect size, even a large effect size.
In short, we conclude that for these six class correlation measurement methods, the correlation between NDC and the change of defect state is significantly higher than that between other process metrics and the change of defect state, and has a certain amount of effect NDC, and NR is the second, whereas DCM is the last. In addition, ANML is superior to NML.

B. CLASSIFICAITON PERFORMANCE OF PROCESS METRICS
This experiment uses NB, KNN, LR, MLP, and SVM as classification algorithms, and all code metrics and each process metric as input variable, to build classification models, and then compares their performance, to study on which process metrics have better classification performance for the change of defect state. Boxplots on five classification algorithms are drawn to clearly and intuitively compare the performance of all process metrics on all projects respectively, as shown in Fig. 3 (a), (b), (c), (d), and (e). The abscissa represents five process metrics, and the ordinate represents the classification performance values. The red line represents the median of all performance measure values of each metric, and the blue square represents the average value of all performance measure values of each metric.
From Figure 3, we can find the following conclusions. i) In terms of Recall, the performance values of five process metrics are similar, but according to the average values, NDC is the best.
ii) In terms of F-Measure, the classification performance of NDC is the best, followed by NR, and ANML ranks third.
iii) In terms of AUC, the performance values of five process metrics are all superior to the performance of random classification, which is 0.5. Further to say, the median and average AUC values of all process metrics are all above 0.7, and some even hit 0.9, which indicates that classification models can be accepted. Similarly, the classification performance of NDC is the best, followed by NR and ANML. iv) In terms of MCC, the performance of NDC is the best, followed by NR and ANML. The values of process metrics are all above 0, indicating that the predicted classes are positively correlated with the actual classes. Further to say, the average performance values of process metrics are all above 0.2, which indicates that their classification performance values are good.
v) The classification performance of all process metrics for elimination of defects is better than that for introduction of defects, especially of NDC and NR.
vi) The advantage of NDC in elimination of defects is more obvious than that in introduction of defects.
So, we can conclude that the classification performance of all process metrics for elimination of defects is better than that for introduction of defects. And among five process metrics, NDC can obtain the best classification performance for the change of defect state in all evaluation measures, followed by NR, and ANML ranks third, which is better than NML. In addition, the advantage of NDC in elimination of defects is more obvious than that in introduction of defects.
To determine the statistical significance between the classification performance of NDC and that of other process metrics, the Wilcoxon matched-pair signed-rank test with 95% confidence interval is applied. Table 5 shows the significance test results of eight evaluation measures between NDC and others on five classification algorithms.
From Table 5, we can observe that there is a significant difference between the classification performance of NDC and that of other process metrics in most evaluation measures and all five classification algorithms, especially for elimination of defects.
In addition, to give a clearer comparison among NDC and other process metrics on the classification performance, we compare the effect size over all 37 datasets between NDC and other process metrics according to Cohen's d, and the Cohen's d result is shown in Table 6.
From Table 6, we can observe that compared with other process metrics except NR, in almost all of the classification  algorithms and evaluation measures, NDC can achieve a smaller or even larger effect on the classification performance. At the same time, compared with the introduction of defects, NDC has more effect on elimination of defects.
In conclusion, the classification performance of NDC is significantly better than other process metric in most evaluation measures, and can obtain a certain amount of effect. And NDC is more important for the elimination of defects than for the introduction of defects. Therefore, we suggest that when the number of defects is large, we should reduce the number of developers, whereas when the number of defects is small, the number of developers can be increased to improve the development efficiency. Moreover, the classification performance of ANML is better than that of NML. Therefore, we suggest that when predicting software defect, we should extract NDC and the code relative change metrics, not only NML.

C. DISCUSSION ABOUT DIFFERENT CLASS IMBALANCE PROCESSING
In this study, the experimental datasets suffer from varying degrees of class imbalance problem, so we use SMOTE to handle the class imbalance problem of the experimental datasets before conducting the experiments. There are plenty of class imbalance handling methods, such as undersampling and over-sampling. To avoid the influence of different class imbalance handling methods on the classification performance, we also compare the classification performance for the change of defects state among five process VOLUME 8, 2020  metrics on under-sampling and over-sampling respectively. Figures 4 and 5 show the boxplots of F-Measure and AUC values of five process metrics obtained by MLP and SVM across 37 versions of 12 projects with over-sampling and under-sampling respectively.
From Figures 4 and 5, we can observe that the median and average of AUC values are all above 0.7, so we can accept these classification models. In addition, the classification performance of all process metrics for elimination of defects is better than that for introduction of defects. Moreover, NDC can obtain the best classification performance for the change of defect state among five process metrics, followed by NR, and ANML ranks third, which are better than NML. The advantage of NDC in elimination of defects is more obvious than that in introduction of defects. We also conduct experiments on other evaluation measures and other classifiers, and the same results have been obtained, which are not fully listed for space reasons. The conclusion is consistent with the datasets processed with SMOTE.
We conduct the Wilcoxon matched-pair signed-rank test at a confidence level of 95%. Tables 7 and 8 show the significance test results in terms of over-sampling and under-sampling respectively. We also compare the effect size over all 37 datasets with over-sampling and undersampling between NDC and other process metrics according to Cohen's d, and the Cohen's d results of over-sampling and under-sampling are shown in Tables 9 and 10 respectively.
From Tables 7 to 10, we can observe that NDC generally achieves the best classification results. And there is a significant difference between the classification performance of NDC and that of other process metrics in most evaluation measures with over-sampling and under-sampling, and NDC can obtain a smaller or even larger effect on the classification performance. We also conduct the Wilcoxon matched-pair signed-rank test and Cohen's d on other evaluation measures and other classifiers, and the same results have been obtained, which are not fully listed for space reasons. So, we can conclude that the experiment results with other class imbalance handling methods is consistent with SMOTE, and class imbalance handling method has no effect on the experimental results.

V. THREATS TO VALIDITY
In this section, we describe the threats to validity of our study in construct validity, internal validity, external validity, and conclusion validity.

A. THREATS TO CONSTRUCT VALIDITY
In this paper, we use Recall, F-Measure, AUC, and MCC to present the classification performance. These results can be further refined by using other performance evaluation measures.
Also, we use the jar package provided by Weka tool to implement our study. Weka is a tool used for machine learning and data mining. We believe that Weka is reliable.

B. THREATS TO INTERNAL VALIDITY
In this paper, most popular and widely used algorithms were considered. We select six class correlation measurement methods and five classical classification algorithms to conduct the experiments. In fact, there are other class correlation measurement methods and classification algorithms. Comparison can be done among a greater number of algorithms. We will include more techniques in the comparative analysis to provide new results in the future.
Also, we focus on five classical, widely used, and defectsrelated process metrics. There are other process metrics, such VOLUME 8, 2020    as the age of class files and experience of developers. Comparison can be done among other process metrics to get more useful conclusions.

C. THREATS TO EXTERNAL VALIDITY
The seven open source projects and five commercial projects used in our study are all Java projects, and are different in application fields, scales, and the proportion of introduction of defects and elimination of defects. Consequently, our empirical study is universal, and the conclusions are general. These results can be further refined by using a greater number of datasets. Increased number of datasets will strengthen the experimental results.

D. THREATS TO CONCLUSION VALIDITY
In the second experiment, in order to eliminate the effect of randomly dividing the instances, we performed 10 times 10-fold cross validation. In addition, Wilcoxon matched-pair signed-rank test and Cohen's d are used to test whether the experimental result among five process metrics is statistically significant and calculate the effect size.

VI. CONCLUSION
Software defect prediction is the application of machine learning in software engineering. In this paper, we focus on the change of defect state of software modules, including introduction of defects and elimination of defects. We investigate which process metrics are significantly important to change of defects in evolving projects by conducting an empirical study on 18 release versions of seven open source projects and 19 release versions of five commercial projects. In detail, we compare the class correlation values among five process metrics by using six class correlation measurement methods, and the classification performance values among five process metrics in terms of four evaluation measures by using five classification algorithms. We also perform statistical analysis of Wilcoxon matched-pair signedrank test and Cohen's d to verify whether the experimental results are statistically significant and calculate the effect size. The experimental results indicate that among these five process metrics, Number of Distinct Committers (NDC) plays a significantly important role in the change of defect state, especially for elimination of defects, and Number of Revisions (NR) is the second, whereas Degree of Code Modification (DCM) is the last. In addition, Average Number of Modified Lines (ANML) is superior to Number of Modified Lines (NML). Based on the experimental results, some suggestions for software development and software defect prediction are also discussed. We suggest that when the number of defects is large, software development managers should reduce the number of developers, whereas when the number of defects is small, the number of developers can be increased to improve the development efficiency. Moreover, ANML is more related to the introduction of defects and the elimination of defects than NML, and the classification performance of ANML is better than that of NML. Therefore, we suggest that when predicting software defect, we should extract the code relative change metrics as well as NDC, not only NML.
In the future, comparison can be done among other process metrics by using more datasets, class correlation measurement methods, classification algorithms, and performance evaluation measures to get more useful conclusions of software development and software testing.