An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation

https://doi.org/10.1016/j.jss.2015.01.028Get rights and content

Highlights

  • Ensembles of adjustment methods are not always superior to single methods.

  • Ensembles of linear methods are more accurate than ensembles of nonlinear methods.

  • Adjustment methods based on GA and NN got the worst accuracy.

  • Changing the value of k makes the prediction models behave diversely.

  • RTM variants is the top ranked type based on Scott–Knott and two-way ANOVA.

Abstract

Context

Effort adjustment is an essential part of analogy-based effort estimation, used to tune and adapt nearest analogies in order to produce more accurate estimations. Currently, there are plenty of adjustment methods proposed in literature, but there is no consensus on which method produces more accurate estimates and under which settings.

Objective

This paper investigates the potential of ensemble learning for variants of adjustment methods used in analogy-based effort estimation. The number k of analogies to be used is also investigated.

Method

We perform a large scale comparison study where many ensembles constructed from n out of 40 possible valid variants of adjustment methods are applied to eight datasets. The performance of each method was evaluated based on standardized accuracy and effect size.

Results

The results have been subjected to statistical significance testing, and show reasonable significant improvements on the predictive performance where ensemble methods are applied.

Conclusion

Our conclusions suggest that ensembles of adjustment methods can work well and achieve good performance, even though they are not always superior to single methods. We also recommend constructing ensembles from only linear adjustment methods, as they have shown better performance and were frequently ranked higher.

Introduction

Analogy-based effort estimation (EBA) is a commonly used method for predicting the most likely software development effort (Angelis and Stamelos, 2000, Auer et al., 2006). It is based on the assumption that software projects with similar characteristics have similar effort values (Keung et al., 2008, Kocaguneli et al., 2012, Shepperd and Kadoda, 2001, Mittas et al., 2008). Reusing efforts of the selected analogies directly without considering revision is less accurate (Azzeh, 2012, Kirsopp et al., 2003). Therefore, an adjustment technique should be applied to calibrate and tune the generated estimate based on the characteristics of both source and target projects. The goal of using adjustment is to minimize differences between a new project and its nearest analogies, and therefore increase EBA's accuracy.

Many adjustment methods have been proposed in the past 20 years (Azzeh, 2012), but as of yet, there is no univocal conclusion as to which adjustment method integrated with EBA produces the most accurate predictions, and under which settings. However, Azzeh's (2012) replication study reported an important insight. He showed that, even though no particular method is significantly superior to others, guidelines can be given to explain how and under what conditions to use each of the existing methods. It has been concluded that each method favors: (1) different feature set, (2) different number of nearest analogies (k) and (3) specific type of features (i.e. continuous or categorical). Moreover, the results from that study showed that some adjustment methods cannot outperform conventional EBA over some datasets. For these reasons it was difficult to recommend a particular method against others over a particular dataset. We believe that it would be more promising to combine existing methods in order to benefit from their individual advantages (and consequently improve the accuracy of adjusted EBA) rather than to create a new adjustment method.

The literature on predictive methods for software effort estimation has shown that combining several predictive models into an ensemble can produce more accurate results than single models (Kocaguneli et al., 2012). Prior work on ensemble methods in the area of data mining also reports that ensembles can produce accurate results in comparison to single models, if not superior (Seni and Elder, 2010, Hastie et al., 2008, Kohavi, 1995). The idea behind the success of ensembles is that the accurate predictions given by some of its models to a given example can patch the mistakes given by others to this example (Kocaguneli et al., 2012). In this way, the overall accuracy of the ensemble can be better than the individual accuracies of its base models. In order to achieve that, it is well accepted that the base models composing the ensemble should be diverse, i.e., they should make different mistakes on the same data points (Minku and Xin, 2013, Chandra and Yao, 2006). If they make the same mistakes, then the ensemble will also make the same mistakes as the individual models, and its performance will be no better than the individual performances. In other words, ensembles of non-diverse models are unsuccessful in improving the accuracy of these models.

Even though ensembles of software effort estimation models have been increasingly studied in software engineering, this is the first study that attempts to combine adjustment methods into ensembles. It is not known whether ensembles of adjustment methods would be successful in improving the accuracy of the calibration of EBA, and consequently the accuracy of EBA itself. In particular, it is not known whether different adjustment techniques behave diversely enough, i.e., if their amount of diversity is enough to lead to improvements in performance. If they do not, then combining these different techniques into an ensemble may not really improve performance. The main objective of this study is thus to investigate the potential of ensembles of adjustment methods for EBA.

With that in mind, this study aims at answering the following research questions:

  • RQ1.

    Is there evidence that ensembles improve the accuracy of adjusted EBA?

  • RQ2.

    Which approach is better for adjustment, linear or non-linear methods?

  • RQ3.

    Is there evidence that using different k analogies makes adjustment methods behave diversely?

The main contributions of this paper are the following:

  • (1)

    An evaluation of each adjusted EBA variant over all datasets to identify the ones that are actual prediction methods based on standardized accuracy (SA) measure and effect size.

  • (2)

    Ranking and clustering of actual prediction methods using Scott–Knott to identify the best methods with smallest mean absolute error.

  • (3)

    A new approach to build ensembles of adjustment methods based on Scott–Knott test method and Borda count procedure. This method can work well when all best methods identified by Scott–Knott are statistically similar. Existing methods such as win-tie-loss (Kocaguneli et al., 2012) cannot work well in this case because their ranking mechanism depends on the significance test between different methods.

  • (4)

    An evaluation of ensembles of adjustment methods against single adjustment methods using SA, effect size and other ranking methods, to determine whether ensembles are successful in improving performance of single adjustment methods.

In summary, this study is the first work to investigate ensembles of adjustment methods and the first work to create ensembles using Scott–Knott test and Borda count procedure.The remainder of the paper is structured as follows: Section 2 presents an overview of ensemble methods, as well as, the related work on adjustment methods and ensembles in software effort estimation. Section 3 describes the methodology conducted in this research. Section 4 shows the obtained results, which are discussed in Section 5. Section 6 presents threats to validity of our study. Finally, Section 7 presents our conclusions.

Section snippets

Ensembles in software effort estimation

Ensembles are learning methods that combine single (aka base) predictive models through a particular aggregation mechanism. The prediction given by the ensemble is a combination of the predictions given by each of its base models, e.g., weighted average (Seni and Elder, 2010). The principal idea of ensembles is that if their models are accurate and diverse, then their performance will be better than the one of its base models. Two models are said to be diverse if they make different errors on

Forty variants of adjustment methods

The methods investigated in this study are collection of linear and nonlinear adjustment methods. These methods were selected because their use has previously been examined in the area of effort estimation.

Constructing ensembles favor using different methods that fail under different circumstances (Ghosh, 2002, Kittler et al., 1998, Alpaydin, 1998). Specifically, ensemble methods perform better when some members of the ensemble correct the errors made by other members. Each adjustment method

Results

This section presents the results of the experiments conducted on 8 datasets and 40 adjustment methods with the aim of providing a better understanding of the relationship between datasets, adjustment methods and number of nearest analogies. In the first section, we evaluate the validity of these adjustment methods and their ability to provide actual predictions. Then, we evaluate the constructed ensemble methods against single methods.

Discussion

Ensemble is a machine learning method that leverages the efficiency of multi-methods to obtain better accuracy than any single method can do. The primary goal when building ensembles is the same as establishing committee of members where each method can patch mistakes done by other methods in that ensemble. In a committee, members compete among themselves, but at the same time, they are complementary to each other. This means that if a member's decision is not right, other members can notice

Threats to validity

This section describes threats to validity of this research with respect to internal and external validities. The main internal validity question is: Is the variation in the dependent variable due to the changes in the independent variable? To address this issue we used eight datasets and applied leave-one-out cross validation in our experiments so that, for each iteration, we used a different test instance and a different train set. The main advantage of the leave-one-out method is that it can

Conclusions

In this paper, we studied eight adjustment methods existing in literature. We have conducted several experiments on 40 variants of single adjustment methods using 4 performance measures and 8 historical datasets to investigate the accuracy of ensembles on adjustment methods. Our results reveal that ensembles of adjustment methods relatively improve the prediction accuracy compared to single adjustment methods used in ABE. Therefore, we conclude that as it is always hard to identify the best

Acknowledgments

Mohammad Azzeh and Ali Bou Nassif are grateful to the Applied Science University, Amman, Jordan, for the financial support granted to carry out this research. Leandro Minku is grateful to EPSRC for the financial support given through the grant no. EP/J017515/1.

Mohammad Y. Azzeh is an assistant professor of Software Engineering at Applied Science University. He holds Ph.D. in computing from University of Bradford, UK and M.Sc. in Software Engineering from University of the West of England, UK. He was a software developer at Motorola UK in 2002. He is currently working as a faculty staff member in software engineering department at Applied Science University. His research interests included software cost estimation, software project management,

References (60)

  • AlpaydinE.

    Techniques for combining multiple learners

  • L. Angelis et al.

    A simulation tool for efficient analogy based cost estimation

    J. Empir. Softw. Eng.

    (2000)
  • M. Auer et al.

    Optimal project feature weights in analogy-based cost estimation: Improvement and limitations

    IEEE Trans. Softw. Eng.

    (2006)
  • AzharD. et al.

    Using ensembles for web effort estimation

  • M. Azzeh

    Model tree based adaptation strategy for software effort estimation by analogy

  • M. Azzeh

    A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation

    J. Empir. Softw. Eng.

    (2012)
  • AzzehM. et al.

    Learning best K analogies from data distribution for case-based software effort estimation

  • E. Bauer et al.

    An empirical comparison of voting classification algorithms: Bagging, boosting, and variants

    J. Mach. Learn.

    (1999)
  • BoehmB.

    Software Engineering Economics

    (1981)
  • P.L. Braga et al.

    Bagging predictors for estimation of software project effort

  • A. Chandra et al.

    Ensemble learning using multi-objective evolutionary algorithms

    J. Math. Model. Algorithm

    (2006)
  • A. Chandra et al.

    Ensemble learning using multi-objective evolutionary algorithms

    J. Math. Model. Algorithm

    (2006)
  • K. Dejaeger et al.

    Data mining techniques for software effort estimation: A comparative study

    IEEE Trans. Softw. Eng.

    (2012)
  • T. Foss et al.

    A simulation study of the model evaluation criterion MMRE

    IEEE Trans. Softw. Eng.

    (2003)
  • J. Ghosh

    Multiclassifier systems: Back to the future

  • T. Hastie et al.

    The Elements of Statistical Learning: Data Mining, Inference and Prediction

    (2008)
  • G. Kadoda et al.

    Experiences using case based reasoning to predict software project effort

  • J. Keung et al.

    Analogy-X: Providing statistical inference to analogy-based software cost estimation

    IEEE Trans. Softw. Eng.

    (2008)
  • M. Khoshgoftaar et al.

    Software quality analysis by combining multiple projects and learners

    J. Softw. Qual. Contr.

    (2009)
  • T.M. Khoshgoftaar et al.

    Enhancing software quality estimation using ensemble-classifier based noise filtering

    Intell. Data Anal.

    (2005)
  • Cited by (88)

    • Locally weighted regression with different kernel smoothers for software effort estimation

      2022, Science of Computer Programming
      Citation Excerpt :

      On the other hand, model-based methods do not rely solely on expert judgment, but they can adopt algorithms to map the relationship between input variables and effort [32,33]. The latter approach is classified into 1) Parametric approach [34], or 2) Induced approach such as using Machine learning algorithms [7,8,40–42,27,29,35–37,6,38,39]. This list is hardly complete.

    View all citing articles on Scopus

    Mohammad Y. Azzeh is an assistant professor of Software Engineering at Applied Science University. He holds Ph.D. in computing from University of Bradford, UK and M.Sc. in Software Engineering from University of the West of England, UK. He was a software developer at Motorola UK in 2002. He is currently working as a faculty staff member in software engineering department at Applied Science University. His research interests included software cost estimation, software project management, search-based software engineering and applications of machine learning algorithms to software engineering problems. He is an invited referee for high quality journals and PC member of international conferences. He is also member of Association of Jordanian Engineers.

    Ali Bou Nassif is currently an adjunct professor at King's University College, as well as holding a position as a post-doctoral fellow at Western University, Canada. He obtained a Master's degree in Computer Science and Ph.D. degree in Electrical and Computer Engineering from Western University in 2009 and 2012, respectively. Prior to joining Western, he worked in the IT field and provided IT services including, but not limited to, IT sales and consulting for several years. He has also taught many courses in Computer Science at the undergraduate level. His research areas include software effort estimation, requirements engineering, cloud computing and service oriented architecture.

    Leandro L. Minku is a research fellow at the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), School of Computer Science, the University of Birmingham, UK. He received the B.Sc., M.Sc. and Ph.D. degrees in Computer Science from the Federal University of Parana, Brazil, in 2003, the Federal University of Pernambuco, Brazil, in 2006, and the University of Birmingham, UK in 2011, respectively. He was an intern at Google Zurich for six months in 2009/2010, the recipient of the Overseas Research Students Award (ORSAS) from the British government, and of several scholarships from the Brazilian Council for Scientific and Technological Development (CNPq). His main research interests include search-based software engineering, software prediction models, machine learning in changing environments, and ensembles of learning machines. His work has been published in internationally renowned journals such as ACM Transactions on Software Engineering and Methodology, IEEE Transactions on Knowledge and Data Engineering, and Neural Networks.

    View full text