Feature Subset Selection and Instance Filtering for Cross-project Defect Prediction-Classification and Ranking

The defect prediction models can be a good tool on organizing the project’s test resources. The models can be constructed with two main goals: 1) to classify the software parts defective or not; or 2) to rank the most defective parts in a decreasing order. However, not all companies maintain an appropriate set of historical defect data. In this case, a company can build an appropriate dataset from known external projects called Crossproject Defect Prediction (CPDP). The CPDP models, however, present low prediction performances due to the heterogeneity of data. Recently, Instance Filtering methods were proposed in order to reduce this heterogeneity by selecting the most similar instances from the training dataset. Originally, the similarity is calculated based on all the available dataset features (or independent variables). We propose that using only the most relevant features on the similarity calculation can result in more accurate filtered datasets and better prediction performances. In this study we extend our previous work. We analyse both prediction goals Classification and Ranking. We present an empirical evaluation of 41 different methods by associating Instance Filtering methods with Feature Selection methods. We used 36 versions of 11 open source projects on experiments. The results show similar evidences for both prediction goals. First, the defect prediction performance of CPDP models can be improved by associating Feature Selection and Instance Filtering. Second, no evaluated method presented general better performances. Indeed, the most appropriate method can vary according to the characteristics of the project being predicted.


Introduction
Software testing activities are crucial for quality assurance in the software development process. Applying those activities, however, can be expensive and the available resources can be limited. The defect prediction models aim at predicting the likely defective parts of the software. Thus, the cost and efficiency of test can be improved by prioritizing the most critical software parts [1].
In an ideal defect prediction it would be possible to predict the exact number of defects for each software part. However, this goal is hard to achieve or even impossible due to the lack of good quality data in practice [2]. Thus, an alternative is to simplify the prediction goal and make it feasible for practical application. In the literature there are two main types of software defect prediction being studied. In the first, the goal is to classify whether a software part is defective or not [3,4,5]. In practice, this kind of prediction allows to distribute the test resources more efficiently, although does not differentiate the level of importance between the defective parts [6]. In the second type, the goal is to predict the instances more likely to contain the larger number of defects and rank them in a decreasing order [7,6,2]. In a context where resources are limited, this kind of prediction allows to direct the software quality assurance team to the most defective parts first [8]. Both prediction goals have their own singularities and practical importance. An analysis involving the two prediction goals provides a comprehensive perspective in the current state-of-the-art of software defect prediction.
In another ideal context of defect prediction it would be possible to predict a software from a base of knowledge constructed from known external projects. Actually, this kind of prediction has already been studied in the literature. This approach is called Cross-project Defect Prediction (CPDP). On the one hand, this approach solves the lack of defect data commonly absent in software companies [9]. On the other, it introduces heterogeneity on data which can compromises the defect prediction performance, as discussed in [10] and [11].
Alternative methods were proposed in order to improve the performance of CPDP models. Among these methods, we can highlight the filtering methods proposed by Turhan et al. [12] and Peters et al. [13]. These methods aim at building an accurate filtered dataset by selecting the most similar instances from the CPDP training dataset. Both methods make use of the Euclidean distance on measuring the similarity between instances. This similarity measure, though, originally considers the entire set of features of the dataset.
We propose that using only the most relevant features on measuring the similarity between instances can improve the performance of the mentioned filtering methods. In order to select the most relevant features of a dataset we refer to the Feature Selection (FS) methods, widely investigated and adopted on the data mining literature [14]. We evaluate four distinct FS methods: Information Gain [15], Relief [16], CFS [14], and Sparcl [17]. We also compare the use of two specific subset of features: code metrics [18] and network metrics [4]. We call the evaluated methods as Instance Filtering methods based on Feature subset Selection (IFFS).
In our previous study [19], we present an empirical evaluation of the mentioned IFFS methods in the Classification context. In this paper we extend this experimentation for both prediction contexts -Classification and Ranking. The two experimentation contexts are similar on their structure although differ in relation to the datasets, the prediction model, and the performance measures, as discussed in the text.
We aim to answer the following research questions for both prediction contexts: • RQ1: Can IFFS methods lead to better performances on CPDP models?
The results indicate a positive answer in both contexts -Classification and Ranking. For all evaluated projects, at least one IFFS method presented better performance in relation to the absent of filtering methods, with statistical significance. In the Classification context, the IFFS methods improved the percentage of models considered successful for practical use. In the Ranking context, the IFFS methods presented significant better performance in relation to the absent of filtering methods and even better performances if compared with random ordering.
The results show similarities in both prediction contexts. The results do not reveal a unique method with general good performance. Instead, the most appropriate method depends on the project characteristics.
This paper is organized as follows. Section 2 provides the necessary background of the related subjects as well as the related works proposed in the literature. Section 3 shows the experimental setup and methodology. Section 4 presents the obtained results and the answers for the research questions. Section 5 presents the threats to validity. Finally, Section 6 concludes the paper and sketches future work.

Software Defect Prediction
Constructing a good defect prediction model in both contexts (Classification and Ranking) encompasses two main issues: 1) building an accurate training dataset and 2) applying an appropriate machine learning algorithm. The training dataset consists of a table of elements (software parts or instances) associated with their respective independent variables (software characteristics or attributes) and the dependent variable (target attribute). In classification problems, the dependent variable corresponds to a binary class: defective or not-defective [4]. In ranking problems, the dependent variable represents the number of defects associated with each instance [7].
In both prediction contexts, several machine learning algorithms have been studied. In [20], the authors compared the prediction performance of 22 classifiers over 10 public domain datasets from the NASA Metrics Data repository. Their results indicate a superior performance of Random Forest in relation to others algorithms. In addition, they found no statistically significant difference among the top 17 classifiers.
Weyuker et al. [6] compared four regression algorithms in the ranking context over three large industrial software systems. Their results also indicate the good performance of Random Forest in relation to others algorithms. Usually, the ranking task is conducted in two steps [21,7,6]: first, a regression model is build and applied in order to predict the number of defects for each instance; then, the instances are ordered based on their predicted number of defects. Yang et al. [2] propose a learning-to-rank approach in which the model is constructed aiming to directly optimize the ranking performance, instead of the two step approach. The proposed model leads to good performances in datasets with few attributes. However, in datasets with a large number of attributes, the proposed method performed worse than Random Forest.

Software Metrics
Some studies in the literature were proposed in order to identify the most relevant independent variables, i.e., the software properties most closely associated with the defective instances or the instances with highest defect density. Basili et al. [22] presented seminal studies in this context by using object-oriented metrics. Ostrand et al. [3] proposed prediction models for industrial large-scale systems by using code metrics and historical defect information. Zimmermann and Nagappan [4] proposed the use of social network analysis metrics (also called network metrics) extracted from the software dependency graph. Further, other metrics were investigated such as developer related metrics [23], organizational metrics [24], process metrics [25], change related metrics [5], and antipatterns related metrics [26]. On this study, we use both code metrics and network metrics since they can be automatically extracted just from the source code. Thus, no historical or additional data is required. Furthermore, some works in the literature indicate that better prediction performances can be obtained by combining these two metrics sets [5,4].

Cross-project Defect Prediction
The dependent variable information can be extracted from historical defect data. When available, those data can be mined and associated with the respective defective software parts [27,18]. However, in practice, not all software companies maintain clear records about defects or do not have sufficient data from previous projects. In this case, the training dataset can be composed by external projects with known defect information. In the literature, this approach is called Cross-project Defect Prediction (CPDP). On this approach, available projects from different application domains and with different characteristics can be agglomerated in a heterogeneous dataset. However, the heterogeneity of data may compromises the efficiency obtained from the defect prediction models [10].
Some alternative approaches were proposed in the literature in order to improve the performance obtained from CPDP models. Representative approaches on this context are Metric Compensation [28], Nearest Neighbour Filter [12], Transfer Naive Bayes [29], TCA+ [30], and Clustering Filter [13]. Among these approaches we highlight the filtering methods proposed by Turhan et al. [12] and Peters et al. [13] since they consist of simple and effective methods. Both filtering methods aim to select the most similar (and relevant) instances from the CPDP training dataset, as shown below.

Instance Filtering methods
Given a cross-project training dataset and a test dataset to be predicted, the filtering methods aim at selecting only the most relevant training instances. The resulting filtered dataset may improve the efficiency obtained from CPDP models.
Turhan et al. [12] proposed a filtering method based on the nearest neighbor filter. On this method, here called Burak filter, each test instance is compared with the entire training dataset. The comparison is based on the Euclidean distance and considers the entire set of independent variables. Then, for each test instance, the k most similar training instances are selected to compose the filtered dataset. This method is illustrated in Figure 1(b), where for each test instance the two most similar training instances are selected. As illustrated, the same training instance can be similar with more than one test instance. The instance labeled with 'L' is not included in the filtered dataset.
The method proposed by Peters et al. [13] selects the most similar instances from a different perspective. First, the training instances are clustered considering all the test instances as centroids. The clustering process makes use of the K-means algorithm with Euclidean distance [31]. As a result, each training instance is associated with its most similar centroid (test instance). This first step is illustrated in Figure 1(c). Note that a test instance can be not selected (label 'L'). Second, for each test instance is selected the most similar training instance of the respective cluster. This second step is illustrated in Figure 1(d). The instances labeled with '1' and '2' represent the most similar instances and the resulting filtered dataset.
For both filtering methods, the resulting filtered dataset is then used as the training dataset.

Feature Subset Selection
Both instance filtering methods presented above make use of the Euclidean distance on measuring the similarity between instances. The set of independent variables for an instance m can be represented by a vector X m = {x m 1 , x m 2 , ..., x m n }, where |X m | = n is the number of variables (or features). The Euclidean distance between two instances i and j can be defined as Let S be a subset of features composed of the most relevant features in X. We propose that using only the most relevant features on the Euclidean distance (i.e., d(S i , S j ) on Equation 1) can lead to a more accurate filtered dataset and then to a more efficient defect model. In order to select the most relevant features S ⊂ X, we evaluate different Feature Selection (FS) methods.
The FS methods have been widely investigated in the data mining literature [14]. Traditionally, the FS methods are applied in high dimensional data in order to reduce the data dimensionality by removing the irrelevant and the redundant features [14]. We focused on a specific category of FS methods called filter model since they are not dependent on the data mining algorithm. We evaluate three supervised methods: Information Gain [15], Relief [16], and CFS [14]; and one unsupervised method: Sparcl [17]. The supervised methods select the most relevant features based on the target attribute. The unsupervised methods, however, select the most relevant features based on their characteristics only, independently of the target attribute.
These different methods represent part of the most important FS methods known in the literature. A specialized feature selection text can be found in [14,17].

Experimental Setup
We conducted an in silico experimentation in order to evaluate the performance of IFFS methods in both contexts: Classification and Ranking. For each experimentation context, we evaluate two filtering methods -Burak Filter [12] and Peters Filter [13] -combined with feature subset selection. We analysed either Feature Selection (FS) methods and specific Software metric Subsets (SS). All experiments and data analysis were made by using the R Project Platform [32].

Software Projects
The experiments on this study were conducted based on 36 versions of 11 Java open source projects, available in the PROMISE repository 1 . The collection and preparation of defect data were made and provided by Jureczko and Madeyski [18]. The authors provide a link 2 with detailed information about the software projects and the construction of each dataset. Each instance in a dataset represents a Java object-oriented class (OO class). Originally, the dependent variable corresponds to the number of defects found for each OO class.
For the ranking analysis we considered the datasets in their original forms. Appendix Table A2 presents the distribution and density of defects for each dataset. For the classification analysis we converted the dependent variable to a binary classification problem (1, for number of defects > 0; or 0, otherwise). Appendix Table A1 list the number of instances, number of defects, and defect rate for each of the analysed datasets.
We evaluate two distinct metrics sets as the independent variables: code metrics (CODE) and network metrics (NET). These metrics are numerical and can be automatically extracted directly from the source code files.

Code Metrics
The code metrics set is composed of 19 metrics and include complexity metrics, C&K (Chidamber and Kemerer ) metrics, and structural code metrics. This metric set has been reported as good quality indicators on literature, as discussed in [18]. These metrics can be extracted by using the Ckjm 3 tool. Further details can be found in [18].

Network Metrics
The network metrics set was extracted in two steps. First, it was constructed the dependency graph for each analysed software version. Each OO class file is represented by a vertex and any dependency relationship between two vertices is represented by an edge on the graph. This process was made by using the PF-CDA tool 4 . The inner classes vertices of a same parent class was treated as a unique vertex. For example, vertices with names "A", "A$B" and "A$C" were all contracted to the parent vertex "A". Once created the dependency graph, we extracted the network metrics data. Table 1 briefly describes the metrics analysed in this study. The metrics set can be grouped in two categories: the ego metrics and the global metrics. The ego metrics refers to the metrics extracted from the ego network. Given a vertex v, its ego network contains all vertices directly connected to v and their respective connections. The global metrics refers to the metrics extracted from the original entire graph. In addition, some metrics can be extracted from three different kinds of networks according to the direction of edges: incoming edges (In), outgoing edges (Out), or undirected edges (All). We analysed 24 different network metrics. From these 24 metrics, considering all their variations (In, Out, All), we extracted a total of 54 network metrics. We computed these metrics by using the igraph package [33]. The analysed metrics were selected according to availability at igraph package or feasibility to implement.

Feature Subset Selection
We analysed the FS methods: CFS (FS CFS) [14], Information Gain (FS IG) [15], Relief (FS RLF) [16], and Sparcl (FS SPC) [17]. The first three are supervised methods. Thus, the feature selection is performed from the training data, in which the target attribute is known. The Sparcl method is unsupervised. Thus, can be performed from both training (FS SPC Tr) and test data (FS SPC Te) since no target attribute information is required. The methods FS IG, FS RLF, and FS SPC provides a rank of weights that measures the relevance of each feature, according to some criterion. Deciding the best k features to compose the feature subset is not a trivial task [14]. Thus, for each method we analysed four different values of k = {5, 10, 15, 20} in order to approximate the best k configuration. Considering all methods and their variations of k, we analysed 17 FS methods for each filtering method. We used the R packages 'FSelector' [34] and 'sparcl' [17] with the implemented FS methods. All the analysed FS methods and their respective implementations permit the application of both types of datasets with continuous (for ranking) and discrete (for classification) values in the dependent variable.
In addition, we investigated two distinct metric subsets: code metrics (SS CODE) and network metrics (SS NET). For each metric subset, we applied a Pearson correlation test in order to disregard the metrics considered redundant, with correlation greater than 0.90 [35]. We also analysed the performance with the original set of features (Orig), as a comparative reference.

Experiment Design
The experiment design is presented in Figure 2. The structure is quite similar for both experimentation contexts -Classification and Ranking. The differences are concentrated in three points: the dataset (Section 3.1); the prediction model (Section 3.4); and the performance measures (Section 3.5), as discussed individually in the text.
First, we joined all datasets in one unique Cross Project Dataset (CPD). Then, we conducted a cross project analysis. Consider a project P with V (P ) versions and a specific version v i of V (P ). For each v i ∈ V (P ), we used v i as test and CP D − V (P ) as training. In this way, we can analyse the prediction performance disregarding any bias from the different versions of a same project. The training data was filtered by applying the Burak and Peters filters combined with the studied feature subsets presented above. The filtered training data is then used to construct the defect prediction models.

Prediction Models
The models were constructed by using the Random Forest Algorithm [36]. This algorithm has shown good performances compared with others algorithms in both contexts -Classification [20,37] and Ranking [6,2]. Lessmann et al. [20] argue that Random Forest is the current state-of-the-art defect predictor. Random Forest is an ensemble of models based on decision trees. At the training time it is constructed a multitude of decision trees from which is calculated and outputted the mode of the classes (for classification) or the mean prediction (for regression) of the individual trees [36]. The Random Forest algorithm presents robustness to redundant and irrelevant attributes although it can produce overfitting models.
Differently of the classification models, the ranking models are constructed in two steps: first, it is constructed a regression model and then the instances are ranked accordingly to their predicted number of defects. For the experiments we used the R package 'randomForest' [38]. It is important to note that some internal processes of Random Forest, such as sampling and bagging, includes randomness which can lead to different models in each execution. Thus, we constructed 30 models for each evaluated method. The results are presented considering the performance mean.
In [4] and [5] the authors argue that network metrics can lead to better prediction performance when combined with code metrics. Thus, we considered both metric sets CODE and NET as predictors on experiments. Again, the features with high Pearson correlation (> 0.90) were disregarded. The remaining set with non redundant features is composed of 20 code metrics and 35 network metrics.

Classification
We analysed three performance measures for classification: recall, probability of false alarm (pf), and gmeasure [39]. Both measures (recall and pf) have the focus on the defective class and are efficient for evaluating datasets with a small number of defective examples [40]. These metrics are also used to evaluate the performance of the two analysed filters in their original works [12,13]. Table 2 presents the confusion matrix and the analysed measures.
The recall measures how much of the actual defects were found. The pf can be described as the probability of an instance predicted as defective to not be a real defective instance. The optimal pf is 0% and the worst result is 100%. The g-measure is the harmonic mean between recall and 1-pf (the specificity).

Ranking
For the ranking task we analysed two performance measures: the percentage of defects (α) in the former (β) instances of the ranking [8,6]; and the fault-percentile-average (FPA) [6,2].
The first measure is widely used in the literature since it is practical and simple. Weyuker et al. [6] studied datasets in which 20% (β = 20%) of the most defective instances contains 80% (α = 80%) or more of the reported defects. This information can be used to analyse how many defects are present in the first 20% instances of the predicted ranking, in relation to the 80% already known. However, this cutoff β = 20% can varies for each project. In our study, for each project, we fixed α ≈ 80% with β ranging between 5% and 74%, as we can see in the Appendix Table A2.
Although the first measure can be efficient for some purposes, it disregards the ordering of the remaining instances in the ranking. Furthermore, it is sensitive to the arbitrary cutoff. Weyuker et al. [6] proposed a general measure, called FPA, that takes into account the whole ranking performance. Consider K instances i 1 , i 2 , ..., i K , listed in a increasing order of predicted number of defects, being i K the most defective instance. Let n k be the actual number of defects for the instance k, and N = n 1 + ... + n K be the total number of defects in the entire list of instances. For any m in 1, ..., K, the proportion of actual defects in the top m predicted files is The FPA measure [6] is then the average of the P m . It is defined as: In other words, FPA is the average of the percentage of actual defects contained in the top m instances (m = 1, 2, ..., K). It means that, if the prediction is good, the most defective instances will occur at or near to the top and then will be counted for most of the terms that contribute with the average.

Results
In this section we discuss the results for the respective research questions in both contexts -Classification and Ranking. We compare the performance of the evaluated IFFS methods with the performance of noFilter. In order to verify the statistical difference of performance, we use the non-parametric statistical test Wilcoxon's Signed-Rank (p-value < 0.05). The symbols and ⊕ represent, respectively, lower and higher performance accordingly to the mean value, with statistical significance. The absence of these symbols means no statistical difference between the compared performances.

RQ1: Can IFFS methods lead to better performances on CPDP models?
Classification Table 3 shows the prediction performances obtained for each project. The performance on the noFilter column are reported by the g-measure (g). All other columns are reported by the relative value ∆g = g − g noF ilter , i.e., negative values show lower performance and positive values show higher performance in relation to noFilter. To better visualization we multiply the values by 100. The bold values indicates the higher ∆g obtained for the respective project row. The projects are presented in decreasing order considering the higher ∆g value. The last row presents the mean value for each column. The columns Best FS present the best performance among all the evaluated FS methods. The columns NET and CODE present the performance obtained by considering the feature subsets SS NET and SS CODE. We can analyse this table in different perspectives. First, we can notice that the columns Best FS presented better performance than Orig in all cases for both filtering methods. It means that at least one of the evaluated FS methods performs better than the original method. The FS methods presented the highest performances in most projects, as highlighted by the bold values. These improvements in performance can be crucial to define if a prediction model is appropriate or not for practical use. Zimmermann et al. [10] present a large experiment on cross-prediction feasibility. Their results show that only 3.4% of the studied models presented a prediction performance considered appropriate to practical use. They considered a successful defect model when accuracy, precision, and recall presents performance greater than 75%. Peters et al. [13] considered both performance thresholds of 60% and 75% for the g-measure. In Figure 3 we compare the percentage of projects with performance greater than 60% and 75% for the methods noFilter, original Burak, original Peters, Burak FS and Peters FS (with reference to the Best FS column). The original methods perform slightly better than noFilter. The highest performances are presented by the filtering methods associated with FS. Considering g ≥ 60, Peters FS and Burak FS presented successful performance for more than 50% of projects. For the threshold g ≥ 75, the method Burak FS presents practical use for almost 20% of projects. Ranking Table 4 shows the performance obtained for each IFFS method considering the percentage of actual defects present in the top instances of the ranking. The respective cutoffs α and β for each project can be viewed in the Appendix Table A2. Table 4 shares the same visual structure as shown in Table 3. The results are relative to noFilter, where ∆α = α − α noF ilter . The column Random can be used as a comparative reference. In theory, as discussed in [2], supposing a context where the test resources are limited and allow to test only 20% of the software, the testers will spend it in the first 20% instances. In a random resource allocation, it is expected to cover only 20% of the total amount of defects in the first 20% instances. However, by prioritizing the most defective software parts it is possible to increase the percentage of covered defects. By comparing the random allocation shown in Table 4 with the β cutoff shown in Appendix Table A2, we can see coincident values (including the mean values of 25%) which corroborates the expected theory. In addition, we can observe a strong learning of the noFilter models in relation to the Random models in which the performances are higher for all projects and the mean value is twice higher.
The column Best FS presented higher performance than noFilter with statistical significance -for both filters Burak and Peters. The original methods, however, presented lower performance than noFilter also with statistical significance. Peters filter presented a general performance slightly better than Burak filter if we consider the mean value of Best FS.
It is important to note that here we ignored all versions of the projects Forrest and Pbeans since their β cutoff (see Appendix Table A2) shown to be very small for this kind of analysis. This sensitivity to the cutoff is one of the drawbacks of this performance measure. Furthermore, the ordering beyond the β cutoff is ignored by this measure. In order to provide a more accurate analysis we also considered the FPA performance measure. Table 5 shows the relative performances obtained for each evaluated method. The columns Max and Random are comparative references. Max is the reference for the actual number of defects and perfect ranking with the highest possible FPA. Again, the column Best FS presents better performances for both filters in relation to noFilter, with statistical significance. The results presented in Table 5 show similar patterns of performances that corroborate for a positive answer for RQ1. For both contexts (Classification and Ranking), the best performances obtained for each project are not dominated by one or just a few IFFS methods. Instead, each project has exclusive characteristics that promotes one or other IFFS method. Thus, we analysed the best performances by counting the frequency in which a method is among the top 5 best performances for a project. We consider all the 41 analysed methods, ((17 F S + 2 SS + Orig) * 2 f ilters) + noF ilter. By this approach we can show the results more clearly. The methods with higher frequency leads to better performances for a larger amount of software projects. Tables 6 and 7 show the 30 methods with highest frequencies among the best performances for Classification and Ranking, respectively. The Ranking frequencies are based on the FPA measure since it covers the performance of all projects and also has shown to be more consistent in this context.
In the Classification context, the method with the highest frequency (BURAK FS IG 10, where the postfix 10 represents the k configuration used on FS IG) covers almost 30% of all projects, followed by PETERS SS CODE with frequency 10. These two methods also appear between the 10 most frequent methods for Ranking. Indeed, we can observe some similarities by comparing the two contexts. For example, 6 of the 10 most frequent methods for Classification are also between the 10 most frequent methods for Ranking. The originals filter methods (with no feature subset selection) presented frequencies lower than noFilter in both prediction contexts. Actually, the performance of noFilter appears between the 15 and 5 highest frequencies for Classification and Ranking, respectively. This result partially contradicts the results presented by Turhan et al. [12] and Peters et al. [13], in which the instance filtering improves the prediction in relation to noFilter. However, some differences can be highlighted between their works and ours. First, we assumed that both original methods consider all features on filtering instances. However, in their works the experiments are conducted considering only the CODE metric set since only this set is available (i.e., the NET metric set is not available for analyses). Here, we can observe that SS CODE presented good performances since it appears between the 10 most frequent methods for both contexts and filters.
In addition, the experiments of Turhan et al. [12] are on the cross-company context and the analysed   projects were developed by different companies. The projects considered on this study were generally developed by the Apache Foundation (except for the Pbeans project). This aspect can influence on the level of heterogeneity of data and then improve the performance of prediction by using noFilter. Lastly, the authors of Peters et al. [13] restrict the testing domain to software projects with less than 100 instances, differently of us (see Appendix Tables A1 and A2). By observing the filters separately, we can note that the Burak filter is present in 5 of the first 7 most frequent methods in both contexts. Those frequencies, however, are not mutually exclusive and may count the same project for several methods. For example, in the classification context it is necessary 10 of the first most frequent methods to cover all the 36 projects.

Threats to Validity
The results indicate that the performance of CPDP models can be improved by applying the evaluated methods. However, some sources of bias can be highlighted from the conducted experiments. First, the evaluated dataset is composed only by Java open source projects. Also, the majority of the evaluated projects belongs to a unique company. These bias can reduce the heterogeneity of data and influence the experiment results. Other datasets with different sources can be used in order to improve the generality of results.
Another important threat to validity is that the number of defects found for each software part is an approximate estimation and does not represent the real number of existent bugs [18]. However, the information about the exact number of defects in a software is difficult if not impossible to acquire in a real project [2]. In Kitchenham et al. [9], the authors argue that software companies frequently do not keep proper historical information about defect data. When available, those information are commonly private or restricted for internal use only [4]. The datasets used in this work fulfill three important characteristics: 1) the datasets are open for reuse; 2) the defect information were extracted from open source projects, which allow us to extract new features from the source code (e.g. the dependency graph and the network metrics); and 3) the entire set is composed of different projects and versions, which enable us to conduct experiments in the cross-project context. Jureczko and Madeyski [18] followed a systematic process to acquire the defect information based on historical bug reports. This procedure is commonly used in the literature [27,11]. The set of datasets provided by Jureczko and Madeyski [18] is also used in other works in the literature of defect prediction, such as [41], [42], [43] and [44].
Also, the models in this study were constructed by using only one machine learning algorithm, sustained by the performance results presented in [20], [37], [6], and [2]. However, the data mining is an active research field and other algorithms may lead to different results.

Conclusion
In this study we investigated the performance of CPDP models by applying filtering methods associated with feature subset selection in both prediction contexts -Classification and Ranking. We evaluated 19 methods derived from different configurations of four distinct Feature Selection methods and two metric subsets. The evidences obtained from the results are similar in both prediction contexts. The results show that for all the analysed projects at least one of the 19 evaluated methods presented better performance in relation to both the original filtering methods and the absent of filtering methods.
In the Classification context, the percentage of models considered successful for practical use was improved. In the Ranking context, besides to present better performances in relation to the comparative reference it was possible to observe the evident learning of defect patterns when compared with random ordering.
We also investigated which of the evaluated methods present better performance. We show a list of the evaluated methods ordered by the frequency in which a method is between the best top five performances for a project. The most frequent methods are present in only 11 (Classification) and 10 (Ranking) of the 36 analysed projects. It indicates that the most appropriate method can vary for each project.
For future works, we are investigating which characteristics better represent a software and how these information can be used in order to predict the most appropriate IFFS method to be applied. Table A2: Summary of the projects characteristics for Ranking. We analysed a ranking performance measure that measures the percentage of defects (α) in the former (β) instances of the ranking. Usually, in the literature, β = 20%. In this study we fixed α ≈ 80% and the β varies depending on the defect distribution for each project