Keywords

1 Introduction

A systematic reuse activity incorporates the utilization of previously developed software components/artifacts to create new software systems, thus leading to the overall reduction in the cost, development time, and effort of software development [1]. For instance, in order to identify the actuality of reusability in existing software projects, investigation was performed on numerous sizeable open-source and individual software projects bundled with popular distributions of BSD and Linux [2]. The observations depicted that almost fifty percent of the total number of files (5.3 million) scrutinized in the projects have been employed in at least two projects. However, no software metrics were employed by the author to conduct this analysis.

Software metrics for calculating the ease of reusability of software projects are vital for accomplishing “development by reuse” and “development for reuse” [3]. Besides, these reuse metrics could turn out to be contributory in the development of reusability prediction models that could be applied by software developers to gather knowledge pertaining to the aggregate expenditure involved in developing a novel version of an existing software or updating an existing software version by knowing the total code that can be reused, i.e., integrated without change in the new version, in advance without requiring to examine the complete codebase. Since the development of new code would need substantial time and effort, reusability valuation of source code components could aid in minimizing this development effort and time, and provide means to assess the development cost of the new software.

The machine learning (ML) algorithms are being effectively utilized to build potent predictor models on a varied set of areas such as engineering, medicine, geology, etc. [4]. Classification, in ML, is the challenge of allocating a new observation to a certain set of category or class (subpopulation), based on a given training set of data having instances (or observations) whose category belonging is known [4]. An algorithm implementing classification is known as a classifier. Meta-classifiers formulate a collection of classifiers and then classify novel data records on the basis of the combination of the results of these classifiers using certain mechanisms. Investigational results have indicated that meta-classifiers often show more accuracy and more robustness even in the presence of noisy data, and obtain a lesser average error rate as compared to the constituent individual classifiers [5].

Therefore, in this study, we conduct an empirical validation by means of the reusability datasets constructed(with respect to seven randomly selected reuse metrics) from four consecutively released versions of a software developed in Java language to determine the competency of seven meta-classification algorithms for developing version to version source code reusability prediction models. In addition to this, we examine and compare the results of these meta-classifiers with logistic regression (LR) [6] technique using performance indicators such as accuracy and AUC ROC analysis. Lastly, we statistically rank all the techniques used in this study using Friedman’s statistical test in order to know which algorithm performs the best.

Though there are studies [7, 8] in the existing literature that assess the ability of classification techniques for change proneness and fault prediction, there has been no study till date performing the statistical comparison of the performance of meta-classifiers with the LR technique to predict version to version source code reusability.

The rest of this paper is organized as follows. The next section delivers a concise summary of the existing literature related to the topic in concern. Section 3 describes the empirical data collected and independent and dependent variable selected as a part of the research background. Section 4 reports the meta-classification techniques employed and the performance evaluation measures selected to evaluate the meta-classifiers. Section 5 describes the empirical analysis results of the application of the seven meta-classifiers and the LR technique along with the Friedman statistical test results. Section 6 comprises the various threats to validity of our work, followed by the last section, in which we state the conclusions and the future work.

2 Related Work

This section provides a brief summary of the literature related to the software reusability prediction and for detailed reading, one can refer to the review [9], which happens to be the most recent and the sole systematic literature review to be done with respect to reusability metrics and prediction of software components according to established guidelines of systematic literature review. Recently, self-organizing maps (SOM) were employed for clustering the datasets corresponding to the CK metrics values which were gathered from three Java-based projects [10]. It was also recently established that the reusability of a source code class increases reciprocally with the increase in depth of inheritance as well as number of children [11]. Moreover, authors [12] have also considered different variants in reuse (common reuse, high-reuse variation, low-reuse variation, and single-use) to empirically and correctly estimate fault proneness across products and across releases of Software Product Lines. Though many metrics have been proposed for the measurement of the reusability of a software component or a software as a whole in the previous research works [3, 9, 13,14,15], in the majority of the cases these have been qualitative reuse metrics, the evaluation of which is certainly dependent on individuals. Also, very few of the machine learning techniques (K-means and hierarchical clustering, support vector machines, artificial neural networks, and decision trees) have been explored for reusability prediction which do not include any ensemble or meta-classifiers or LR techniques. Moreover, these articles do not include the comparison of actual metric values against concrete reuse results of a realistic software development environment to validate the reusability prediction. Also, an empirical evaluation of the software reuse transpiring within the same product family and from version to version via the reuse metrics has not been analyzed in the literature examined.

To redress this, we assess a wide range of meta-classification techniques for reusability prediction with the help of four datasets which have been created from four consecutively released versions of a realistic Java-based software using actual values of their reuse results and some randomly selected reuse metrics. We also use Friedman statistical test to allocate statistical ranks to the techniques for the purpose of determining if the selected meta-classifiers significantly outperform the LR technique, with the motive to provide empirical substantiation to evaluate the best version to version reusability prediction model.

3 Research Background

The following subsections constitute an overview of the empirical data collected and the metrics (dependent variables and independent variable) incorporated to form the reusability prediction models.

3.1 Empirical Data Collection

In order to construct the version to version file reusability prediction models, we created four datasets using four consecutive versions of the JFreeChartFootnote 1 software, which is a free (LGPL) chart library for the Java platform. The details of the four selected versions have been illustrated in Table 1. Numerical values of seven different and commonly used static code metrics were collected corresponding to each of the Java files existing in each of the four selected versions of JFreeChart using two different open-source static code analysis tools- Stan4JFootnote 2 and JHawk 6.1.3.Footnote 3 These Java files are basically composed of one or more than one java classes (or interfaces). In those cases where there exists more than one class per file, an aggregate value of the metrics with respect to all the classes is collected. A brief description of the seven selected reuse metrics (Coupling between objects, Efferent Coupling, Depth of inheritance, Lack of cohesion between methods, Number of Calls, Number of methods defined in a file, and Cyclomatic Complexity) has been given in below:

Table 1 Details of the four JFreeChart versions employed in this study

Coupling between objects [16] is stated as the total number of files coupled to a given file. Two source code files are said to be coupled to each other if the methods declared in one makes use of some other code file’s instance variables or methods, whereas Efferent Coupling [17] only evaluates the total number of external files used by a given file. Depth of inheritance [16] is the maximum length of a path from a given source code file to the root code file in the inheritance structure of the given software. Lack of cohesion between methods [16] calculates the total of different methods in a given code file that refer a given instance variable. Number of Calls is the number of method calls (in statements as well as in logical expressions) in the target file. Number of methods defined in a file [18] is the total of methods contained in a given Java code file, and lastly, Cyclomatic Complexity [19] measures the number of independent paths through program source code.

We used a clone detection tool called AntiCut&Paste [20] in order to estimate the reusability of individual source code files of each of the versions. Source code files of two successive versions of the selected JFreeChart software were supplied to the tool as an input and it returned those Java files which were discovered to be common to the two releases. Post this step, a binary variable of a “No” or “Yes” was calculated as the reuse statistic and was allocated for every source code file with “Yes” indicating that the file had been reused into next version in its entirety and without any modifications and “No” indicating that the file had not been employed in the next version or was employed with some modification.

3.2 Dependent and Independent Variables

In our study, the binary variable of reusability is selected as the dependent variable which is to be estimated via the independent variables. The independent variables are those for which the results need to be calculated for the prediction of the reuse statistic of a Java file in the next release of the software. Therefore as per our context of application, the independent variables are the seven software metrics discussed in Sect. 3.1.

4 Research Methodology

Having described the construction of the four datasets in Sect. 3, this section discusses the various meta-classification techniques incorporated in this study for the prediction of source code file reusability. The measures undertaken to analyze the performance of the selected meta-classification algorithms have also been stated.

4.1 Meta-Classification Techniques Employed

This section provides a brief description of the seven meta-classifiers [21] (AdaBoost, Bagging, Filtered, Multi-class (M-Class), Random Sub Space (RSS), Stacking and Voting) incorporated in this study.

AdaBoost or Adaptive Boosting utilizes a sequence of simple weighted classifiers, where each classifier is made to analyze a separate characteristic of the data, to finally generate an all-inclusive classifier, with the help of which there is a high probability of obtaining a low misclassification error rate as compared to an individual classifier. Bagging (stands for Bootstrap Aggregation) classifier supplies random subgroups of the original dataset to every base classifiers and then combines their separate predictions (either using averaging or voting) to decide the absolute prediction. The Filtered classifier consists of running an arbitrary classifier on data that has been passed through an arbitrary filter which uses some mathematical evaluation (that is based on some intrinsic characteristic of the training set like correlation). Multi-class (M-Class) classifier formulates a methodology to convert a given multi-class problem into numerous binary class problems. The metric is evaluated for every class by considering it to be a binary classification problem after combining all the remaining classes as second-class entities. Then the weighted average (weighted by class frequency) or macro average (treat each class equally) metric is obtained by averaging the binary metric over all the classes. However, unlike binary classification problems, here one does not need to select a threshold score to generate predictions. The label or class obtaining the highest predicted score is the predicted answer. The Random Sub Space (RSS) classifier is similar to bagging technique except here random subsets of the dataset are drawn as random subsets of the features whereas in bagging the samples are drawn with replacement. Stacking is similar to boosting. The difference here is that several classifiers are combined using the stacking method instead of an empirical formula for the weight function and the base learner’s predictions are given as input for a meta-level classifier whose output is the final class. In the Voting methodology, the base-level classifier’s predictions are combined in accordance with a static voting scheme (usually the plurality voting scheme).

4.2 Performance Evaluation Measures

Two performance evaluation measures: Accuracy and Area under the ROC Curve (AUC) are chosen to assess the predictive performance of the selected algorithms against version to version source code file reusability. The accuracy of a model is depicted as the ratio of the number of Java files that are predicted with accuracy to the total number of Java files in the version. The AUC metric is obtained through the ROCFootnote 4 plotting which indicates the optimal cutoff point at which both sensitivity and specificity are maximized. AUC values >= 0.7 and < 0.8 exhibit acceptable division; AUC values >= 0.8 and < 0.9 exhibit excellent division; and AUC >= 0.9, exhibit outstanding division between the reused and non-reused files by the prediction algorithm.

5 Empirical Analysis

Results of the models constructed using meta-classifiers to predict the version to version file reusability on the four selected versions of JFreeChart are described in this section of the paper. These results were predicted using the WEKA tool which is an open-source tool and is easily obtained on http://www.cs.waikato.ac.nz/ml/weka/. The Naïve Bayes [5] classifier was used as the base classifier/learner. We used the default settings of tool to construct the seven meta-classification models.

5.1 Model Evaluation Results

The columns in Table 2 show the version-wise Accuracy (Acc.) in %, and Area under Curve (AUC) scored by each of the seven meta-classification algorithms on each of the four versions of JFreeChart. In order to acquire a supplementary accurate assessment with respect to the predictive potential of the selected classification models, a k cross-validation [22] of all the models generated in this research was conducted in which the dataset is randomly separated into roughly “k” equivalent subsets and for each assessment, one of the k subsets is employed as the test set and the training set is formed with the residual k−1 subsets. This process is reiterated for all the k subsets. In this study, the meta-classification models generated observations stated in Table 2 were validated with a “k” value of 10.

Table 2 Results of the meta-classification analysis

From the values of the performance measures observed in Table 2, six out of seven models(except the Stacking meta-classifier) exhibit good results, i.e., depicting high scores for both the performance metrics, with accuracy values ranging from 75.8 to 90% and AUC values ranging from 0.64 to 0.88. The obtained results, especially the AUC results show that the six out of seven selected meta-classification algorithms display an acceptable and, in some cases, excellent discrimination between the reused and non-reused files contained in the four JFreeChart versions, thus demonstrating their effectiveness for developing fitting and authentic version to version source code reusability prediction models. The Stacking techniqueFootnote 5 achieves the lowest AUC values (< 0.50). Thus even though it achieves high accuracy values (which is solely due to the correct classification of the reusable classes) the Stacking meta-classifier does not qualify as an efficient predictor of version to version reusability.

We also applied the LR technique on the four selected datasets, the results of which are stated in Table 3.

Table 3 Binary logistic regression results

The results indicate that the LR technique also shows a comparable performance to six out of seven meta-classifiers, especially to the Vote meta-classifier with accuracy values ranging from 73.3 to 83.9% and AUC values ranging from 0.79 to 0.87.

Though there does exist a difference(however minor) in the prediction performance of the models developed via the selected techniques, we needed to ascertain if the difference between them holds statistical significance for which we conduct the Friedman statistical test [23].

Table 4 reports the mean ranks scored by every technique after the use of Friedman test, where the model with the lowest mean rank is the one which performs the worst. The test was based on the AUC results obtained by each of the models on the four selected JFreeChart datasets. According to the results stated in Table 4, the technique with the best performance on all the three datasets is the Vote meta-classifier with a mean rank of 7. This is followed by the LR algorithm which obtains a mean rank of 6, which is closely followed by the M-Class meta-classifier with a mean rank of 5.25. The Stacking meta-classifier is selected as the worst meta-classifier with the lowest mean rank of 1.

Table 4 Friedman statistical test results

The Friedman statistical value with degree of freedom seven was evaluated to be 15.421, which is true for α = 0.05. Additionally, a p-value of 0.031 exhibits that the results obtained are true at a 95% confidence interval. Therefore, the null hypothesis of Friedman test that states that all techniques perform the same is rejected. The eight techniques (seven meta-classifiers and LR technique) taken under consideration are significantly diverse in their performance behavior. The results, however, indicate that only one meta-classifier “Vote” out of the seven meta-classifiers performs better than the LR statistical technique therefore establishing the LR technique to be also effective for the prediction of version to version source code file reusability.

6 Threats to Validity

There can be potential threats to this study like any other empirical study. Our research work focuses only on the estimation of prediction performance of the selected meta-classifiers which was performed using the statistical and machine learning methods, with the metrics as the independent variables and the reuse parameter of Yes/No as the dependent variable. Thusly, the threat to internal validity exists since this research work does not signify to establish change-outcome. The most critical threat to the external validity of our study is that our results may not generalize to a similar sample or new research environment and could be constrained to the surveyed systems, i.e., the four selected JFreeChart versions. In order to ascertain the generalizability of the classification inferences made in our study, the predictive performance of the selected techniques on similar datasets developed using other programming languages need to be evaluated. So this threat exists in the study. The construct validity makes sure that independent and dependent variables properly represent the concepts. The metric data was collected via mature source code analysis and clone detection tools. Although we make no declarations with respect to the accuracy of these software, we suppose that the tools collect the data reliably as they are in fact being employed effectively in practice [24, 25], therefore decreasing the threat to construct validity.

7 Conclusion and Future Work

This research work constitutes the evaluation of seven meta-classification techniques to predict version to version source code file reusability. The empirical validation was done on four datasets created using four consecutively released versions of JFreeChart. To the best of the author’s knowledge, no study till now has made use of meta-classifiers for reusability prediction. We further compared the performance of the selected meta-classifiers with the statistical LR technique and ranked the performances of the various algorithms using the Friedman statistical test.

Following are chief conclusions made from this analysis:

All the selected meta-classifiers, except for the Stacking technique, showed reasonably good performances(Accuracy and AUC values ranging from 75.8 to 90% and 0.64 to 0.88) over the four selected versions of the JFreeChart software and did not depict extremely divergent outcomes as far as the values of the performance measures over the four versions were concerned. The LR technique also showed comparable performance to the meta-classifiers (Accuracy and AUC values ranging from 73.3 to 83.9% and 0.79 to 0.87) for the prediction of version to version source code reusability over the four selected datasets.

Moreover, with the results of the Friedman test, it was statistically clarified, at a confidence interval of 95%, that only one meta-classifier –“Vote” significantly outperforms the LR technique in giving the best performance results, followed by the LR technique. Rest all of the meta-classifiers perform poorly as compared to the LR technique, thus establishing that the LR technique and six classifiers(AdaBoost, Bagging, Filtered, Multi-class (M-Class), Random Sub Space (RSS), and Voting) out of the seven selected meta-classifiers are indeed effective for developing fitting and authentic version to version source code reusability prediction models.

Future work may involve the replication of the selected meta-classifiers on other similar software datasets for the purpose of yielding generalized results. Application of other prediction models like deep learning could also be done to establish their pertinence in the development of software reusability models.