Cross-version defect prediction: use historical data, cross-project data, or both?

Amasaki, Sousuke

doi:10.1007/s10664-019-09777-8

Cross-version defect prediction: use historical data, cross-project data, or both?

Published: 28 January 2020

Volume 25, pages 1573–1595, (2020)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

Sousuke Amasaki ORCID: orcid.org/0000-0001-8763-3457¹

944 Accesses
24 Citations
1 Altmetric
Explore all metrics

Abstract

Context

Although a long-running project has experienced many releases, removing defects from a product is still a challenge. Cross-version defect prediction (CVDP) regards project data of prior releases as a useful source for predicting fault-prone modules based on defect prediction techniques. Recent studies have explored cross-project defect prediction (CPDP) that uses the project data from outside a project for defect prediction. While CPDP techniques and CPDP data can be diverted to CVDP, its effectiveness has not been investigated.

Objective

To investigate whether CPDP approaches and CPDP data are useful for CVDP. The investigation also compared the usage of prior release data.

Method

We chose a style of replication of a previous comparative study on CPDP approaches.

Results

Some CPDP approaches could improve the performance of CVDP. The use of the latest prior release was the best choice. If one has no CVDP data, the use of CPDP data for CVDP was found to be effective.

Conclusions

1) Some CPDP approaches could improve CVDP, 2), if one can access project data from the latest release, project data from older releases would not bring clear benefit, and 3) even if one has no CVDP data, appropriate CPDP approaches would be able to deliver quality prediction with CPDP data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Yusuf Sulistyo Nugroho, Hideaki Hata & Kenichi Matsumoto

Test case selection and prioritization using machine learning: a systematic literature review

Article 14 December 2021

Rongqi Pan, Mojtaba Bagherzadeh, … Lionel Briand

Software defect prediction: future directions and challenges

Article 27 February 2024

Zhiqiang Li, Jingwen Niu & Xiao-Yuan Jing

Notes

http://www.spinelis.gr/sw/ckjm

References

Amasaki S (2018) Cross-version defect prediction using cross-project defect prediction approaches. In: Proc. of PROMISE ’18. ACM, pp 32–41
Amasaki S, Kawata K, Yokogawa T (2015) Improving cross-project defect prediction methods with data simplification. In: Proc. of SEAA ’15. IEEE, pp 96–103
Arisholm E, Briand LC (2006) Predicting fault-prone components in a java legacy system. In: Proc. of ISESE ’06. ACM, pp 1–10
Bennin KE, Toda K, Kamei Y, Keung J, Monden A, Ubayashi N (2016) Empirical evaluation of cross-release effort-aware defect prediction models. In: Proc. of QRS ’16. IEEE, pp 214–221
Bin Y, Zhou K, Lu H, Zhou Y, Xu B (2017) Training data selection for cross-project defection prediction: which approach is better? In: Proc. of ESEM ’17. IEEE, pp 354–363
Boucher A, Badri M (2018) Software metrics thresholds calculation techniques to predict fault-proneness: an empirical comparison. Inf Softw Technol 96:38–67
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Briand LC, Melo WL, Wüst J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720
Article Google Scholar
Broomhead DS, Lowe D (1988) Multivariate functional interpolation and adaptive networks. Complex Syst 2:321–355
MATH Google Scholar
Canfora G, De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2013) Multi-objective cross-project defect prediction. In: Proc. of ICST ’13. IEEE, pp 252–261
Chen L, Fang B, Shang Z, Tang Y (2015) Negative samples reduction in cross-company software defects prediction. Inf Softw Technol 62(C):67–77
Article Google Scholar
Cheng M, Wu G, Wan H, You G, Yuan M, Jiang M (2016) Exploiting correlation subspace to predict heterogeneous cross-project defects. Int J Softw Eng Knowl Eng 26(09 & 10):1571–1580
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cox DR (1958) Two further applications of a model for binary regression. Biometrika 45(3):562–565
Article MATH Google Scholar
D’Ambros M, Lanza M, Robbes R (2010) An extensive comparison of bug prediction approaches. In: Proc. of MSR ’10. IEEE, pp 31–41
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MathSciNet MATH Google Scholar
Domingos P, Pazzani M (1997) On the optimality of the simple bayesian classifier under zero-one loss. Mach Learn 29(2–3):103–130
Article MATH Google Scholar
Erika CCA, Ochimizu K (2009) Towards logistic regression models for predicting fault-prone code across software projects. In: Proc. of ESEM ’09. IEEE, pp 460–463
Harman M, Islam S, Jia Y, Minku LL, Sarro F, Srivisut K (2014) Less is more: temporal fault predictive performance over multiple hadoop releases. In: Proc. of SSBSE’14. Springer, pp 240–246
He Z, Shu F, Yang Y, Li M, Wang Q (2012) An investigation on the feasibility of cross-project defect prediction. Autom Softw Eng 19(2):167–199
Article Google Scholar
He Z, Peters F, Menzies T, Yang Y (2013) Learning from open-source projects: an empirical study on defect prediction. In: Proc. of ESEM ’13. IEEE, pp 45–54
He P, Li B, Ma Y (2014) Towards cross-project defect prediction with imbalanced feature sets. CoRR 1411.4228
He P, Li B, Liu X, Chen J, Ma Y (2015) An empirical study on software defect prediction with a simplified metric set. Inf Softw Technol 59:170–190
Article Google Scholar
Herbold S (2013) Training data selection for cross-project defect prediction. In: Proc. of PROMISE ’13. ACM, pp 6:1–6:10
Herbold S (2015) CrossPare: a tool for benchmarking cross-project defect predictions. In: Proc. of ASEW ’15. IEEE, pp 90–96
Herbold S, Trautsch A, Grabowski J (2017) A comparative study to benchmark cross-project defect prediction approaches. IEEE Trans Softw Eng, 1–25. https://doi.org/10.1109/TSE.2018.2790413
Article Google Scholar
Herbold S, Trautsch A, Grabowski J (2018) Correction of ”a comparative study to benchmark cross-project defect prediction approaches”. IEEE Trans Softw Eng, 1–5. https://doi.org/10.1109/TSE.2018.2790413
Article Google Scholar
Herzig K, Just S, Rau A, Zeller A (2013) Predicting defects using change genealogies. In: Proc. of ISSRE ’13. IEEE, pp 118–127
Holschuh T, Pauser M, Herzig K, Zimmermann T, Premraj R, Zeller A (2009) Predicting defects in sap java code: An experience report. In: Proc. of ICSE ’09 - companion volume. IEEE, pp 172–181
Hosseini S, Turhan B, Mäntylä M (2018) A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction. Inf Softw Technol 95:296–312
Article Google Scholar
Jing X, Wu F, Dong X, Qi F, Xu B (2015) Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning. In: Proc. of ESEC/FSE ’15. ACM, pp 496–507
Jing XY, Wu F, Dong X, Xu B (2017) An improved SDA based defect prediction framework for both within-project and cross-project class-imbalance problems. IEEE Trans Softw Eng 43(4):321–339
Article Google Scholar
Jureczko M, Madeyski L (2010) Towards identifying software project clusters with regard to defect prediction. In: Proc. of PROMISE ’10. ACM, pp 9:1–9:10
Kawata K, Amasaki S, Yokogawa T (2015) Improving relevancy filter methods for cross-project defect prediction. In: Proc. of ACIT-CSI ’15, pp 2–7
Khoshgoftaar TM, Seliya N (2003) Fault prediction modeling for software quality estimation: comparing commonly used techniques. Empir Softw Eng 8:3
Google Scholar
Khoshgoftaar TM, Rebours P, Seliya N (2009) Software quality analysis by combining multiple projects and learners. Softw Qual J 17(1):25–49
Article Google Scholar
Li Z, Jing XY, Zhu X, Zhang H, Xu B, Ying S (2017) On the multiple sources and privacy preservation issues for heterogeneous defect prediction. IEEE Trans Softw Eng, 1–21
Liu Y, Khoshgoftaar TM, Seliya N (2010) Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans Softw Eng 36(6):852–864
Article Google Scholar
Lu H, Kocaguneli E, Cukic B (2014) Defect prediction between software versions with active learning and dimensionality reduction. In: Proc. of ISSRE ’14. IEEE, pp 312–322
Ma Y, Luo G, Zeng X, Chen A (2012) Transfer learning for cross-company software defect prediction. Inf Softw Technol 54(3):248–256
Article Google Scholar
Madeyski L, Jureczko M (2015) Which process metrics can significantly improve defect prediction models? An empirical study. Softw Qual J 23(3):1–30
Article Google Scholar
Menzies T, Butcher A, Marcus A, Zimmermann T, Cok D (2011) Local versus global models for effort estimation and defect prediction. In: Proc. of ASE ’11. IEEE, pp 343–351
Monden A, Hayashi T, Shinoda S, Shirai K, Yoshida J, Barker M, Matsumoto K (2013) Assessing the cost effectiveness of fault prediction in acceptance testing. IEEE Trans Softw Eng 39(10):1345–1357
Article Google Scholar
Nam J, Kim S (2015) CLAMI: defect prediction on unlabeled datasets. In: Proc. of ASE ’15. IEEE, pp 452–463
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: Proc. of ICSE ’13. IEEE, pp 382–391
Nam J, Fu W, Kim S, Menzies T, Tan L (2018) Heterogeneous defect prediction. IEEE Trans Softw Eng 44(9):874–896
Article Google Scholar
Panichella A, Oliveto R, De Lucia A (2014) Cross-project defect prediction models: L’Union fait la force. In: Proc. of CSMR-WCRE ’14. IEEE, pp 164–173
Peters F, Menzies T (2012) Privacy and utility for defect prediction: experiments with MORPH. In: Proc. of ICSE ’12. IEEE, pp 189–199
Peters F, Menzies T, Gong L, Zhang H (2013a) Balancing privacy and utility in cross-company defect prediction. IEEE Trans Softw Eng 39(8):1054–1068
Article Google Scholar
Peters F, Menzies T, Marcus A (2013b) Better cross company defect prediction. In: MSR ’13: 10th IEEE working conference on mining software repositories. IEEE, pp 409–418
Peters F, Menzies T, Layman L (2015) LACE2: better privacy-preserving data sharing for cross project defect prediction. In: Proc. of ICSE ’15. IEEE, pp 801–811
Premraj R, Herzig K (2011) Network versus code metrics to predict defects: a replication study. In: Proc. of ESEM ’11. IEEE, pp 215–224
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc
Rahman F, Posnett D, Devanbu P (2012) Recalling the ”imprecision” of cross-project defect prediction. In: Proc. of ESEC/FSE ’12. ACM, pp 61:1–61:11
Rana R, Staron M, Berger C, Hansson J, Nilsson M, Meding W (2014) The adoption of machine learning techniques for software defect prediction: an initial industrial validation. In: Proc. of joint conference on knowledge-based software engineering. Springer, pp 270–285
Ryu D, Choi O, Baik J (2014) Value-cognitive boosting with a support vector machine for cross-project defect prediction. Empir Softw Eng 21(1):1–29
Google Scholar
Ryu D, Jang JI, Baik J (2015) A hybrid instance selection using nearest-neighbor for cross-project defect prediction. J Comput Sci Technol 30(5):969–980
Article Google Scholar
Sarro F, Di Martino S, Ferrucci F, Gravino C (2012) A further analysis on the use of genetic algorithm to configure support vector machines for inter-release fault prediction. In: Proc. of SAC ’12. ACM, pp 1215–1220
Shepperd MJ, Song Q, Sun Z, Mair C (2013) Data quality: some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Article Google Scholar
Tosun A, Bener A, Turhan B, Menzies T (2010) Practical considerations in deploying statistical methods for defect prediction: a case study within the Turkish telecommunications industry. Inf Softw Technol 52(11):1242–1257
Article Google Scholar
Turhan B (2012) On the dataset shift problem in software engineering prediction models. Empir Softw Eng 17(1–2):62–74
Article Google Scholar
Turhan B, Menzies T, Bener AB, Di Stefano J (2009) On the relative value of cross-company and within-company data for defect prediction. Empir Softw Eng 14 (5):540–578
Article Google Scholar
Turhan B, Tosun AM, Bener AB (2013) Empirical evaluation of the effects of mixed project data on learning defect predictors. Inf Softw Technol 55(6):1101–1118
Article Google Scholar
Uchigaki S, Uchida S, Toda K, Monden A (2012) An ensemble approach of simple regression models to cross-project fault prediction. In: Proc. of SNPD ’12. IEEE, pp 476–481
Watanabe S, Kaiya H, Kaijiri K (2008) Adapting a fault prediction model to allow inter languagereuse. In: Proc. of PROMISE ’08. ACM, pp 19–24
Wu R, Zhang H, Kim S, Cheung SC (2011) ReLink: recovering links between bugs and changes. In: Proc. of ESEC/FSE ’11. ACM, pp 15–25
Xia X, Lo D, Pan SJ, Nagappan N, Wang X (2016) HYDRA: massively compositional model for cross-project defect prediction. IEEE Trans Softw Eng 42 (10):977–998
Article Google Scholar
Xu Z, Li S, Tang Y, Luo X, Zhang T, Liu J, Xu J (2018a) Cross version defect prediction with representative data via sparse subset selection. In: Proc. of ICPC ’18. ACM, pp 1–12
Xu Z, Liu J, Luo X, Zhang T (2018b) Cross-version defect prediction via hybrid active learning with kernel principal component analysis. In: Proc. of SANER ’18. IEEE, pp 209–220
Yu Q, Jiang S, Zhang Y (2017) A feature matching and transfer approach for cross-company defect prediction. J Syst Softw 132:366–378
Article Google Scholar
Yu X, Wu M, Jian Y, Bennin KE, Fu M, Ma C (2018) Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning. Soft Comput 22(10):1–12
Article Google Scholar
Zhang Y, Lo D, Xia X, Sun J (2015) An Empirical Study of Classifier Combination for Cross-Project Defect Prediction. In: Proc. of COMPSAC ’15. IEEE, pp 264–269
Zhang F, Zheng Q, Zou Y, Hassan AE (2016) Cross-project defect prediction using a connectivity-based unsupervised classifier. In: Proc. of ICSE ’16. ACM, pp 309–320
Zhang Y, Lo D, Xia X, Sun J (2018) Combined classifier for cross-project defect prediction: an extended empirical study. Front Comput Sci 12(2):280–296
Article Google Scholar
Zhao Y, Yang Y, Lu H, Liu J, Leung H, Wu Y, Zhou Y, Xu B (2017) Understanding the value of considering client usage context in package cohesion for fault-proneness prediction. Autom Softw Eng 24(2):393–453
Article Google Scholar
Zhou Y, Yang Y, Lu H, Chen L, Li Y, Zhao Y, Qian J, Xu B (2018) How far we have progressed in the journey? an examination of cross-project defect prediction. ACM Trans Softw Eng Methodol 27(1):1–51
Article Google Scholar
Zimmermann T, Nagappan N, Gall H, Giger E, Murphy B (2009) Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In: Proc. of ESEC/FSE ’09. ACM, pp 91–100

Download references

Acknowledgments

This work was partially supported by JSPS KAKENHI under Grant No. 18K11246.

Author information

Authors and Affiliations

Okayama Prefectural University, 111 Kuboki, Soja, 719-1197, Japan
Sousuke Amasaki

Authors

Sousuke Amasaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sousuke Amasaki.

Additional information

Communicated by: Shane McIntosh, Leandro L. Minku, Ayşe Tosun, and Burak Turhan

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Predictive Models and Data Analytics in Software Engineering (PROMISE)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amasaki, S. Cross-version defect prediction: use historical data, cross-project data, or both?. Empir Software Eng 25, 1573–1595 (2020). https://doi.org/10.1007/s10664-019-09777-8

Download citation

Published: 28 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10664-019-09777-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-version defect prediction: use historical data, cross-project data, or both?