An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation

doi:10.1016/j.jss.2015.01.028

Journal of Systems and Software

Volume 103, May 2015, Pages 36-52

https://doi.org/10.1016/j.jss.2015.01.028 Get rights and content

Highlights

•
Ensembles of adjustment methods are not always superior to single methods.
•
Ensembles of linear methods are more accurate than ensembles of nonlinear methods.
•
Adjustment methods based on GA and NN got the worst accuracy.
•
Changing the value of k makes the prediction models behave diversely.
•
RTM variants is the top ranked type based on Scott–Knott and two-way ANOVA.

Abstract

Context

Effort adjustment is an essential part of analogy-based effort estimation, used to tune and adapt nearest analogies in order to produce more accurate estimations. Currently, there are plenty of adjustment methods proposed in literature, but there is no consensus on which method produces more accurate estimates and under which settings.

Objective

This paper investigates the potential of ensemble learning for variants of adjustment methods used in analogy-based effort estimation. The number k of analogies to be used is also investigated.

Method

We perform a large scale comparison study where many ensembles constructed from n out of 40 possible valid variants of adjustment methods are applied to eight datasets. The performance of each method was evaluated based on standardized accuracy and effect size.

Results

The results have been subjected to statistical significance testing, and show reasonable significant improvements on the predictive performance where ensemble methods are applied.

Conclusion

Our conclusions suggest that ensembles of adjustment methods can work well and achieve good performance, even though they are not always superior to single methods. We also recommend constructing ensembles from only linear adjustment methods, as they have shown better performance and were frequently ranked higher.

Introduction

Analogy-based effort estimation (EBA) is a commonly used method for predicting the most likely software development effort (Angelis and Stamelos, 2000, Auer et al., 2006). It is based on the assumption that software projects with similar characteristics have similar effort values (Keung et al., 2008, Kocaguneli et al., 2012, Shepperd and Kadoda, 2001, Mittas et al., 2008). Reusing efforts of the selected analogies directly without considering revision is less accurate (Azzeh, 2012, Kirsopp et al., 2003). Therefore, an adjustment technique should be applied to calibrate and tune the generated estimate based on the characteristics of both source and target projects. The goal of using adjustment is to minimize differences between a new project and its nearest analogies, and therefore increase EBA's accuracy.

Many adjustment methods have been proposed in the past 20 years (Azzeh, 2012), but as of yet, there is no univocal conclusion as to which adjustment method integrated with EBA produces the most accurate predictions, and under which settings. However, Azzeh's (2012) replication study reported an important insight. He showed that, even though no particular method is significantly superior to others, guidelines can be given to explain how and under what conditions to use each of the existing methods. It has been concluded that each method favors: (1) different feature set, (2) different number of nearest analogies (k) and (3) specific type of features (i.e. continuous or categorical). Moreover, the results from that study showed that some adjustment methods cannot outperform conventional EBA over some datasets. For these reasons it was difficult to recommend a particular method against others over a particular dataset. We believe that it would be more promising to combine existing methods in order to benefit from their individual advantages (and consequently improve the accuracy of adjusted EBA) rather than to create a new adjustment method.

The literature on predictive methods for software effort estimation has shown that combining several predictive models into an ensemble can produce more accurate results than single models (Kocaguneli et al., 2012). Prior work on ensemble methods in the area of data mining also reports that ensembles can produce accurate results in comparison to single models, if not superior (Seni and Elder, 2010, Hastie et al., 2008, Kohavi, 1995). The idea behind the success of ensembles is that the accurate predictions given by some of its models to a given example can patch the mistakes given by others to this example (Kocaguneli et al., 2012). In this way, the overall accuracy of the ensemble can be better than the individual accuracies of its base models. In order to achieve that, it is well accepted that the base models composing the ensemble should be diverse, i.e., they should make different mistakes on the same data points (Minku and Xin, 2013, Chandra and Yao, 2006). If they make the same mistakes, then the ensemble will also make the same mistakes as the individual models, and its performance will be no better than the individual performances. In other words, ensembles of non-diverse models are unsuccessful in improving the accuracy of these models.

Even though ensembles of software effort estimation models have been increasingly studied in software engineering, this is the first study that attempts to combine adjustment methods into ensembles. It is not known whether ensembles of adjustment methods would be successful in improving the accuracy of the calibration of EBA, and consequently the accuracy of EBA itself. In particular, it is not known whether different adjustment techniques behave diversely enough, i.e., if their amount of diversity is enough to lead to improvements in performance. If they do not, then combining these different techniques into an ensemble may not really improve performance. The main objective of this study is thus to investigate the potential of ensembles of adjustment methods for EBA.

With that in mind, this study aims at answering the following research questions:

RQ1.
Is there evidence that ensembles improve the accuracy of adjusted EBA?
RQ2.
Which approach is better for adjustment, linear or non-linear methods?
RQ3.
Is there evidence that using different k analogies makes adjustment methods behave diversely?

The main contributions of this paper are the following:

(1)
An evaluation of each adjusted EBA variant over all datasets to identify the ones that are actual prediction methods based on standardized accuracy (SA) measure and effect size.
(2)
Ranking and clustering of actual prediction methods using Scott–Knott to identify the best methods with smallest mean absolute error.
(3)
A new approach to build ensembles of adjustment methods based on Scott–Knott test method and Borda count procedure. This method can work well when all best methods identified by Scott–Knott are statistically similar. Existing methods such as win-tie-loss (Kocaguneli et al., 2012) cannot work well in this case because their ranking mechanism depends on the significance test between different methods.
(4)
An evaluation of ensembles of adjustment methods against single adjustment methods using SA, effect size and other ranking methods, to determine whether ensembles are successful in improving performance of single adjustment methods.

In summary, this study is the first work to investigate ensembles of adjustment methods and the first work to create ensembles using Scott–Knott test and Borda count procedure.The remainder of the paper is structured as follows: Section 2 presents an overview of ensemble methods, as well as, the related work on adjustment methods and ensembles in software effort estimation. Section 3 describes the methodology conducted in this research. Section 4 shows the obtained results, which are discussed in Section 5. Section 6 presents threats to validity of our study. Finally, Section 7 presents our conclusions.

Section snippets

Ensembles in software effort estimation

Ensembles are learning methods that combine single (aka base) predictive models through a particular aggregation mechanism. The prediction given by the ensemble is a combination of the predictions given by each of its base models, e.g., weighted average (Seni and Elder, 2010). The principal idea of ensembles is that if their models are accurate and diverse, then their performance will be better than the one of its base models. Two models are said to be diverse if they make different errors on

Forty variants of adjustment methods

The methods investigated in this study are collection of linear and nonlinear adjustment methods. These methods were selected because their use has previously been examined in the area of effort estimation.

Constructing ensembles favor using different methods that fail under different circumstances (Ghosh, 2002, Kittler et al., 1998, Alpaydin, 1998). Specifically, ensemble methods perform better when some members of the ensemble correct the errors made by other members. Each adjustment method

Results

This section presents the results of the experiments conducted on 8 datasets and 40 adjustment methods with the aim of providing a better understanding of the relationship between datasets, adjustment methods and number of nearest analogies. In the first section, we evaluate the validity of these adjustment methods and their ability to provide actual predictions. Then, we evaluate the constructed ensemble methods against single methods.

Discussion

Ensemble is a machine learning method that leverages the efficiency of multi-methods to obtain better accuracy than any single method can do. The primary goal when building ensembles is the same as establishing committee of members where each method can patch mistakes done by other methods in that ensemble. In a committee, members compete among themselves, but at the same time, they are complementary to each other. This means that if a member's decision is not right, other members can notice

Threats to validity

This section describes threats to validity of this research with respect to internal and external validities. The main internal validity question is: Is the variation in the dependent variable due to the changes in the independent variable? To address this issue we used eight datasets and applied leave-one-out cross validation in our experiments so that, for each iteration, we used a different test instance and a different train set. The main advantage of the leave-one-out method is that it can

Conclusions

In this paper, we studied eight adjustment methods existing in literature. We have conducted several experiments on 40 variants of single adjustment methods using 4 performance measures and 8 historical datasets to investigate the accuracy of ensembles on adjustment methods. Our results reveal that ensembles of adjustment methods relatively improve the prediction accuracy compared to single adjustment methods used in ABE. Therefore, we conclude that as it is always hard to identify the best

Acknowledgments

Mohammad Azzeh and Ali Bou Nassif are grateful to the Applied Science University, Amman, Jordan, for the financial support granted to carry out this research. Leandro Minku is grateful to EPSRC for the financial support given through the grant no. EP/J017515/1.

References (60)

G. Brown et al.
Diversity creation methods: A survey and categorisation
Inform. Fusion
(2005)
N.H. Chiu et al.
The adjusted analogy-based software effort estimation based on similarity distances
J. Syst. Softw.
(2007)
M. Jorgensen
A review of studies on expert estimation of software development effort
J. Syst. Softw.
(2004)
M. Jorgensen et al.
Software effort estimation by analogy and regression toward the mean
J. Syst. Softw.
(2003)
S. Koch et al.
Software project effort estimation with voting rules
J. Decis. Support Syst.
(2009)
U. Lipowezky
Selection of the optimal prototype subset for 1-NN classification
J. Pattern Recog. Lett.
(1998)
J. Miller
Replicating software engineering experiments: A poisoned chalice or the Holy Grail?
J. Inform. Softw. Technol.
(2005)
L.L. Minku et al.
Ensembles and locality: Insight on improving software effort estimation
J. Inform. Softw. Technol.
(2013)
N. Mittas et al.
Improving analogy-based software cost estimation by a resampling method
J. Inform. Softw. Technol.
(2008)
M. Shepperd et al.
Evaluating prediction systems in software project estimation
J. Inform. Softw. Technol.
(2012)

AlpaydinE.

Techniques for combining multiple learners

L. Angelis et al.

A simulation tool for efficient analogy based cost estimation

J. Empir. Softw. Eng.

(2000)

M. Auer et al.

Optimal project feature weights in analogy-based cost estimation: Improvement and limitations

IEEE Trans. Softw. Eng.

(2006)

AzharD. et al.

Using ensembles for web effort estimation

M. Azzeh

Model tree based adaptation strategy for software effort estimation by analogy

M. Azzeh

A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation

J. Empir. Softw. Eng.

(2012)

AzzehM. et al.

Learning best K analogies from data distribution for case-based software effort estimation

E. Bauer et al.

An empirical comparison of voting classification algorithms: Bagging, boosting, and variants

J. Mach. Learn.

(1999)

BoehmB.

Software Engineering Economics

(1981)

P.L. Braga et al.

Bagging predictors for estimation of software project effort

A. Chandra et al.

Ensemble learning using multi-objective evolutionary algorithms

J. Math. Model. Algorithm

(2006)

A. Chandra et al.

Ensemble learning using multi-objective evolutionary algorithms

J. Math. Model. Algorithm

(2006)

K. Dejaeger et al.

Data mining techniques for software effort estimation: A comparative study

IEEE Trans. Softw. Eng.

(2012)

T. Foss et al.

A simulation study of the model evaluation criterion MMRE

IEEE Trans. Softw. Eng.

(2003)

J. Ghosh

Multiclassifier systems: Back to the future

T. Hastie et al.

The Elements of Statistical Learning: Data Mining, Inference and Prediction

(2008)

G. Kadoda et al.

Experiences using case based reasoning to predict software project effort

J. Keung et al.

Analogy-X: Providing statistical inference to analogy-based software cost estimation

IEEE Trans. Softw. Eng.

(2008)

M. Khoshgoftaar et al.

Software quality analysis by combining multiple projects and learners

J. Softw. Qual. Contr.

(2009)

T.M. Khoshgoftaar et al.

Enhancing software quality estimation using ensemble-classifier based noise filtering

Intell. Data Anal.

(2005)

Cited by (88)

Performance of heterogenous neuro-fuzzy ensembles over medical datasets
2023, Scientific African
Neuro-fuzzy systems combine the abilities of both artificial neural networks and fuzzy systems. They are easily trainable and provide a certain level of interpretability. Their performance has been assessed in different application domains and many attempts have been made to improve it using ensemble learning. However, to the best of our knowledge, no study has investigated the performance of heterogeneous neuro-fuzzy ensembles in a medical context. In this study, we constructed, evaluated, and compared the performance of 26 heterogeneous neuro-fuzzy ensembles on four medical datasets. The five single classifiers used were based on the Takagi-Sugeno-Kang (TSK) and Mamdani fuzzy inference systems. The metrics employed to measure the performance of the ensemble classifiers were the accuracy, precision, and recall. Additionally, the Borda count method and Scott-Knott statistical test were used to rank and cluster the classifiers, respectively. The results show that ensemble classifiers rarely outperform their base classifiers. Moreover, ensembles composed of TSK base classifiers performed best. In addition, we noticed that ensembles comprising four base learners achieved the best performance. Finally, no ensemble classifier managed to score high-performance values across the four datasets.
Geothermal flow in Northern Morocco: A machine learning approach
2023, Journal of African Earth Sciences
To model the geothermal flow in northern Morocco, we apply a machine learning (ML) approach by analyzing the geological and geophysical data at boreholes, where direct heat flow data is available. The data includes geothermal flow, geothermal gradient, Bouguer gravity, aero-magnetic field, tectonic fracture density and the proximity level to the major faults, thermal spring data (location, temperature, density), geochronological age of formations, Earthquake data (density, magnitude, location), and the proximity level to the recent volcanism in Morocco. These data represent factors that generally control the spatial variation of geothermal flow. For this purpose, we used four ML techniques: Multi-Layer Perceptron, Support Vector Regression, K-Nearest Neighbors, and Decision trees, and examined the impact of parameter settings for each ML technique on the performance results. In specifically, we examine two parameter tuning methods: Grid Search (GS) and the Python Tool's default parameters. To determine the significance of the performance differences and rank ML techniques according to their performances, the Skott-Knott test, and the Borda Count voting system were investigated. We identified the optimized MLP by means of GS as the best ML technique. The cartographic representation of predicted geothermal flow values, by the optimized MLP (GS + MLP) model over the northern Morocco, shows areas of high geothermal flow values. These areas often correspond to magmatic intrusions at depth connected to the regional geodynamic context. The lowest values of geothermal flow are less than 40 mW m⁻² and are mainly predicted in the Anti-Atlas chain of Precambrian age and in which is part of the West African Craton. This method is a new approach that may help to identify areas with high geothermal potential based on geological and geophysical data.
Locally weighted regression with different kernel smoothers for software effort estimation
2022, Science of Computer Programming
Citation Excerpt :
On the other hand, model-based methods do not rely solely on expert judgment, but they can adopt algorithms to map the relationship between input variables and effort [32,33]. The latter approach is classified into 1) Parametric approach [34], or 2) Induced approach such as using Machine learning algorithms [7,8,40–42,27,29,35–37,6,38,39]. This list is hardly complete.
Estimating software effort has been a largely unsolved problem for decades. One of the main reasons that hinders building accurate estimation models is the often heterogeneous nature of software data with a complex structure. Typically, building effort estimation models from local data tend to be more accurate than using the entire data. Previous studies have focused on the use of clustering techniques and decision trees to generate local and coherent data that can help in building local prediction models. However, these approaches may fall short in some aspect due to limitations in finding optimal clusters and processing noisy data. In this paper we used a more sophisticated locality approach that can mitigate these shortcomings that is Locally Weighted Regression (LWR). This method provides an efficient solution to learn from local data by building an estimation model that combines multiple local regression models in k-nearest-neighbor based model. The main factor affecting the accuracy of this method is the choice of the kernel function used to derive the weights for local regression models. This paper investigates the effects of choosing different kernels on the performance of Locally Weighted Regression of a software effort estimation problem. After comprehensive experiments with 7 datasets, 10 kernels, 3 polynomial degrees and 4 bandwidth values with a total of 840 Locally Weighted Regression variants, we found that: 1) Uniform kernel functions cannot outperform non-uniform kernel functions, and 2) kernel type, polynomial degrees and bandwidth parameters have no specific effect on the estimation accuracy. In other words, no change in bandwidth or degree values occurred with a significant difference in kernel rankings. In short, Locally Weighted Regression methods with Triweight or Triangle kernel can perform better than more complex kernels. Hence, we encourage non-uniform kernel methods as smoother function with wide bandwidth and small polynomial degree.
Ensemble effort estimation with metaheuristic hyperparameters and weight optimization for achieving accuracy
2024, PLoS ONE
Feature importance for software development effort estimation using multi level ensemble approaches
2024, Bulletin of Electrical Engineering and Informatics
Evaluating the impact of filter-based feature selection in intrusion detection systems
2024, International Journal of Information Security

View all citing articles on Scopus

Mohammad Y. Azzeh is an assistant professor of Software Engineering at Applied Science University. He holds Ph.D. in computing from University of Bradford, UK and M.Sc. in Software Engineering from University of the West of England, UK. He was a software developer at Motorola UK in 2002. He is currently working as a faculty staff member in software engineering department at Applied Science University. His research interests included software cost estimation, software project management, search-based software engineering and applications of machine learning algorithms to software engineering problems. He is an invited referee for high quality journals and PC member of international conferences. He is also member of Association of Jordanian Engineers.

Ali Bou Nassif is currently an adjunct professor at King's University College, as well as holding a position as a post-doctoral fellow at Western University, Canada. He obtained a Master's degree in Computer Science and Ph.D. degree in Electrical and Computer Engineering from Western University in 2009 and 2012, respectively. Prior to joining Western, he worked in the IT field and provided IT services including, but not limited to, IT sales and consulting for several years. He has also taught many courses in Computer Science at the undergraduate level. His research areas include software effort estimation, requirements engineering, cloud computing and service oriented architecture.

Leandro L. Minku is a research fellow at the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA), School of Computer Science, the University of Birmingham, UK. He received the B.Sc., M.Sc. and Ph.D. degrees in Computer Science from the Federal University of Parana, Brazil, in 2003, the Federal University of Pernambuco, Brazil, in 2006, and the University of Birmingham, UK in 2011, respectively. He was an intern at Google Zurich for six months in 2009/2010, the recipient of the Overseas Research Students Award (ORSAS) from the British government, and of several scholarships from the Brazilian Council for Scientific and Technological Development (CNPq). His main research interests include search-based software engineering, software prediction models, machine learning in changing environments, and ensembles of learning machines. His work has been published in internationally renowned journals such as ACM Transactions on Software Engineering and Methodology, IEEE Transactions on Knowledge and Data Engineering, and Neural Networks.

View full text

An empirical evaluation of ensemble adjustment methods for analogy-based effort estimation

Highlights

Abstract

Context

Objective

Method

Results

Conclusion

Introduction

Section snippets

Ensembles in software effort estimation

Forty variants of adjustment methods

Results

Discussion

Threats to validity

Conclusions

Acknowledgments

Inform. Fusion

J. Syst. Softw.

J. Syst. Softw.

J. Syst. Softw.

J. Decis. Support Syst.

J. Pattern Recog. Lett.

J. Inform. Softw. Technol.

J. Inform. Softw. Technol.

J. Inform. Softw. Technol.

J. Inform. Softw. Technol.

Techniques for combining multiple learners

A simulation tool for efficient analogy based cost estimation

J. Empir. Softw. Eng.

Optimal project feature weights in analogy-based cost estimation: Improvement and limitations

IEEE Trans. Softw. Eng.

Using ensembles for web effort estimation

Model tree based adaptation strategy for software effort estimation by analogy

A replicated assessment and comparison of adaptation techniques for analogy-based effort estimation

J. Empir. Softw. Eng.

Learning best K analogies from data distribution for case-based software effort estimation

An empirical comparison of voting classification algorithms: Bagging, boosting, and variants

J. Mach. Learn.

Software Engineering Economics

Bagging predictors for estimation of software project effort

Ensemble learning using multi-objective evolutionary algorithms

J. Math. Model. Algorithm

Ensemble learning using multi-objective evolutionary algorithms

J. Math. Model. Algorithm

Data mining techniques for software effort estimation: A comparative study

IEEE Trans. Softw. Eng.

A simulation study of the model evaluation criterion MMRE

IEEE Trans. Softw. Eng.

Multiclassifier systems: Back to the future

The Elements of Statistical Learning: Data Mining, Inference and Prediction

Experiences using case based reasoning to predict software project effort

Analogy-X: Providing statistical inference to analogy-based software cost estimation

IEEE Trans. Softw. Eng.

Software quality analysis by combining multiple projects and learners

J. Softw. Qual. Contr.

Enhancing software quality estimation using ensemble-classifier based noise filtering

Intell. Data Anal.