Theory of relative defect proneness

Koru, A. Güneş; Emam, Khaled El; Zhang, Dongsong; Liu, Hongfang; Mathew, Divya

doi:10.1007/s10664-008-9080-x

Theory of relative defect proneness

Replicated studies on the functional form of the size-defect relationship

Published: 05 September 2008

Volume 13, pages 473–498, (2008)
Cite this article

Empirical Software Engineering Aims and scope Submit manuscript

A. Güneş Koru¹,
Khaled El Emam^2,3,
Dongsong Zhang¹,
Hongfang Liu⁴ &
…
Divya Mathew¹

579 Accesses
61 Citations
Explore all metrics

Abstract

In this study, we investigated the functional form of the size-defect relationship for software modules through replicated studies conducted on ten open-source products. We consistently observed a power-law relationship where defect proneness increases at a slower rate compared to size. Therefore, smaller modules are proportionally more defect prone. We externally validated the application of our results for two commercial systems. Given limited and fixed resources for code inspections, there would be an impressive improvement in the cost-effectiveness, as much as 341% in one of the systems, if a smallest-first strategy were preferred over a largest-first one. The consistent results obtained in this study led us to state a theory of relative defect proneness (RDP): In large-scale software systems, smaller modules will be proportionally more defect-prone compared to larger ones. We suggest that practitioners consider our results and give higher priority to smaller modules in their focused quality assurance efforts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Software defect prediction: future directions and challenges

Article 27 February 2024

Importance Measures in Reliability Engineering: An Introductory Overview

Notes

Webcite link: http://www.webcitation.org/5RqqbCKKm (cached Sep. 14, 2007)
Webcite link: http://www.webcitation.org/5Rqr0BSz8 (cached Sep. 14, 2007)
CVS was the source code control system used by the KOffice developers. Webcite link: http://www.webcitation.org/5RrT2BaV1 (cached Sep. 14, 2007)
Perl is a stable, cross platform programming language. Webcite link: http://www.webcitation.org/5RrTDEdYV (cached Sep. 14, 2007)

References

Akiyama F (1971) An example of software system debuggings. In: Information processing 71, Proceedings of IFIP congress 71, vol 1. IFIP, Amsterdam, pp 353–359
Google Scholar
Andersen PK, Borgan O, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, Heidelberg
MATH Google Scholar
Askari M, Holt R (2006) Information theoretic evaluation of change prediction models for large-scale software. In: Workshop on mining software repositories, MSR 2006, ICSE workshop, Shanghai, 22–23 May 2006
Basili VR, Perricone BT (1984) Software errors and complexity: an empirical investigation. Commun ACM 27(1):42–52
Article Google Scholar
Briand LC, Basili VR, Hetmanski CJ (1993) Developing interpretable models with optimized set reduction for identifying high-risk software components. IEEE Trans Softw Eng 19(11):1028–1044
Article Google Scholar
Briand LC, Bunse C, Daly JW (2001) A controlled experiment for evaluating quality guidelines on the maintainability of object-oriented designs. IEEE Trans Softw Eng 27(6):513–530
Article Google Scholar
Briand LC, Melo WL, Wüst J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720
Article Google Scholar
Chayes F (1971) Ratio correlation: a manual for students of petrology and geochemistry. University of Chicago Press, Chicago
Google Scholar
Compton BT, Withrow C (1990) Prediction and control of ada software defects. J Syst Softw 12(3):199–207
Article Google Scholar
Cox DR (1972) Regression models and life tables. J Royal Stat Soc 34:187–220
MATH Google Scholar
El Emam K (2005) The ROI from software quality. Auerbach Publications, Taylor and Francis Group, LLC, Boca Raton
MATH Google Scholar
El Emam K, Benlarbi S, Goel N, Rai SN (2001) The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans Softw Eng 27(7):630–650
Article Google Scholar
El Emam K, Benlarbi S, Goel N, Melo W, Lounis H, Rai SN (2002) The optimal class size for object-oriented software. IEEE Trans Softw Eng 28(5):494–509
Article Google Scholar
Fenton N, Pfleeger SL (1996) Software metrics: a rigorous and practical approach, 2nd edn. PWS, Boston
Google Scholar
Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689
Article Google Scholar
Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814
Article Google Scholar
Funami Y, Halstead MH (1976) A software physics analysis of akiyama’s debugging data. In: Proceedings of MRI XXIV international symposium on computer software engineering. IEEE, Piscataway, pp 133–138
Google Scholar
Gaffney JE (1984) Estimating the number of faults in code. IEEE Trans Softw Eng 10(4):459–465
Article MathSciNet Google Scholar
Halstead MH (1977) Elements of software science. Elsevier, Amsterdam
MATH Google Scholar
Harrell FE (2001) Regression modeling strategies: with applications to linear modes, logistic regression, and survival analysis. Springer, Heidelberg
Google Scholar
Harrell FE (2005) Design: design package. R package version 2.0–12. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/Design
Harvey AC, Collier P (1977) Testing for functional misspecification in regression analysis. J Econom 6(1):103–119
Article MATH Google Scholar
Hatton L (1997) Reexamining the fault density-component size connection. IEEE Softw 14(2):89–97
Article Google Scholar
Hatton L (1998) Does oo sync with how we think? IEEE Softw 15(3):46–54
Article Google Scholar
Hosmer DW, Lemeshow S (1999) Applied survival analysis: regression modeling of time to event data. Wiley, New York
MATH Google Scholar
Khoshgoftaar TM, Allen EB, Hudepohl J, Aud S (1997) Applications of neural networks to software quality modeling of a very large telecommunications system. IEEE Trans Neural Netw 8(4):902–909
Article Google Scholar
Koru AG, Tian J (2003) An empirical comparison and characterization of high defect and high complexity modules. J Syst Softw 67(3):153–163
Article Google Scholar
Koru AG, Tian J (2004) Defect handling in medium and large open source projects. Softw IEEE 21(4):54–61
Article Google Scholar
Koru AG, Ma L, Li Z (2003) Utilizing operational profile in refactoring large scale legacy systems. In: WCRE 2003: first IEEE international workshop on refactoring: achievements, challenges, effects, Victoria, November 2003
Koru AG, Zhang D, Liu, H (2007) Modeling the effect of size on defect proneness for open-source software. In: Predictor models in software engineering, PROMISE’07, 20–26 May 2007
Lipow M (1982) Number of faults per line of code. IEEE Trans Softw Eng 8(4):437–439
Article Google Scholar
McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 2(6):308–320
Article MathSciNet Google Scholar
Meine JPvdM, Miguel AR (2007) Correlations between internal software metrics and software dependability in a large population of small c/c++ programs. In: The 18th IEEE international symposium on software reliability. IEEE, Trollhattan, pp 203–208
Mockus A, Fielding RT, Herbsleb J (2002) Two case studies of open source software development: apache and mozilla. ACM Trans Softw Eng Methodol 11(3):309–346
Article Google Scholar
Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433
Article Google Scholar
Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46:323
Article Google Scholar
Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355
Article Google Scholar
Promise (2007) Promise data repository
R Development Core Team (2003) R: a language and environment for statistical computing. ISBN 3-900051-00-3
Raymond ES (1999) The Cathedral and the Bazaar: musings on Linux and open source by an accidental revolutionary. O’Reilly, Sebastopol
Google Scholar
Rosenberg J (1997) Some misconceptions about lines of code. In: METRICS ’97: Proceedings of the 4th international symposium on software metrics. IEEE Computer Society, Washington, DC, pp 137–142
Chapter Google Scholar
Schmidt DC (1995) Using design patterns to develop reusable object-oriented communication software. Commun ACM 38(10):65–74
Article Google Scholar
Scientific Toolworks I (2003) Understand for c++: user guide and reference manual, January. I Scientific Toolworks, St. George
Shen VY, Yu TJ, Thebaut SM, Paulsen L (1985) Identifying error-prone software - an empirical study. IEEE Trans Softw Eng 11(4):317–324
Article Google Scholar
Therneau TM (1999) Survival: survival analysis package, including penalized likelihood. R package v. 2.29. http://cran.r-project.org/web/packages/survival/index.html
Therneau TM, Grambsch PM (2000) Modeling survival data: extending the Cox model. Springer, Heidelberg
MATH Google Scholar
Tian J, Troster J (1998) A comparison of measurement and defect characteristics of new and legacy software systems. J Syst Softw 44(2):135–146
Article Google Scholar
Troster J, Tian J (1995) Defect characteristics of legacy software: measurement, visualization, regression analysis, and tree-based modeling. Technical report, IBM SWS Toronto Laboratory, March
Withrow C (1990) Error density and size in ada software. IEEE Softw 7(1):26–30
Article Google Scholar

Download references

Acknowledgements

We would like to thank Frank E. Harrell for extending and modifying some of the functionality in his Design package for us, Victor R. Basili for his helpful comments, Jeff Tian for providing data, the associate editor, Tim Menzies, for his guidance and suggestions, and the anonymous reviewers of this paper for their helpful and constructive feedback.

Author information

Authors and Affiliations

Department of Information Systems, UMBC, 1000 Hilltop Circle, Baltimore, MD, 21250, USA
A. Güneş Koru, Dongsong Zhang & Divya Mathew
Childrens Hospital of Eastern Ontario, CHEO Research Institute, E-Health Information Laboratory, 401 Smyth Road, Ottawa, Ontario, K1H 8L1, Canada
Khaled El Emam
Faculty of Medicine and School of Information Technology, University of Ottawa, Ottawa, Ontario, Canada
Khaled El Emam
Department of Biostatistics, Bioinformatics, and Biomathematics, School of Medicine, Georgetown University, Suite 180, Building D, 4000 Reservoir Rd, NW, Washington, DC, 20057-1484, USA
Hongfang Liu

Authors

A. Güneş Koru
View author publications
You can also search for this author in PubMed Google Scholar
Khaled El Emam
View author publications
You can also search for this author in PubMed Google Scholar
Dongsong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hongfang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Divya Mathew
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Güneş Koru.

Appendix

In this appendix, we explain how to calculate the RDP of the modules chosen by one inspection strategy with respect to those chosen by another inspection strategy. The first inspection strategy chooses m modules having sizes (in LOC), s ₁,s ₂,...,s _m, and the second one chooses n modules having sizes, S ₁,S ₂,...,S _n.

First, let us take a reference module with size C. Since we observed a logarithmic shape for the link function, following (2) and omitting the time parameter t to simplify the notation, the RDP of an individual module of size s with respect to this reference module at any time t would be e ^{β(ln s − lnC)}. For each inspection strategy, we calculate the sum of the RDP of the selected individual modules with respect to the reference module. To find the RDP, we simply take the ratio of these sums:

(8)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koru, A.G., Emam, K.E., Zhang, D. et al. Theory of relative defect proneness. Empir Software Eng 13, 473–498 (2008). https://doi.org/10.1007/s10664-008-9080-x

Download citation

Published: 05 September 2008
Issue Date: October 2008
DOI: https://doi.org/10.1007/s10664-008-9080-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Theory of relative defect proneness

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

Software defect prediction: future directions and challenges

Importance Measures in Reliability Engineering: An Introductory Overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Theory of relative defect proneness

Abstract

Access this article

Similar content being viewed by others

How different are different diff algorithms in Git?

Software defect prediction: future directions and challenges

Importance Measures in Reliability Engineering: An Introductory Overview

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation