Skip to main content
Log in

Theory of relative defect proneness

Replicated studies on the functional form of the size-defect relationship

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

In this study, we investigated the functional form of the size-defect relationship for software modules through replicated studies conducted on ten open-source products. We consistently observed a power-law relationship where defect proneness increases at a slower rate compared to size. Therefore, smaller modules are proportionally more defect prone. We externally validated the application of our results for two commercial systems. Given limited and fixed resources for code inspections, there would be an impressive improvement in the cost-effectiveness, as much as 341% in one of the systems, if a smallest-first strategy were preferred over a largest-first one. The consistent results obtained in this study led us to state a theory of relative defect proneness (RDP): In large-scale software systems, smaller modules will be proportionally more defect-prone compared to larger ones. We suggest that practitioners consider our results and give higher priority to smaller modules in their focused quality assurance efforts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Webcite link: http://www.webcitation.org/5RqqbCKKm (cached Sep. 14, 2007)

  2. Webcite link: http://www.webcitation.org/5Rqr0BSz8 (cached Sep. 14, 2007)

  3. CVS was the source code control system used by the KOffice developers. Webcite link: http://www.webcitation.org/5RrT2BaV1 (cached Sep. 14, 2007)

  4. Perl is a stable, cross platform programming language. Webcite link: http://www.webcitation.org/5RrTDEdYV (cached Sep. 14, 2007)

References

  • Akiyama F (1971) An example of software system debuggings. In: Information processing 71, Proceedings of IFIP congress 71, vol 1. IFIP, Amsterdam, pp 353–359

    Google Scholar 

  • Andersen PK, Borgan O, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, Heidelberg

    MATH  Google Scholar 

  • Askari M, Holt R (2006) Information theoretic evaluation of change prediction models for large-scale software. In: Workshop on mining software repositories, MSR 2006, ICSE workshop, Shanghai, 22–23 May 2006

  • Basili VR, Perricone BT (1984) Software errors and complexity: an empirical investigation. Commun ACM 27(1):42–52

    Article  Google Scholar 

  • Briand LC, Basili VR, Hetmanski CJ (1993) Developing interpretable models with optimized set reduction for identifying high-risk software components. IEEE Trans Softw Eng 19(11):1028–1044

    Article  Google Scholar 

  • Briand LC, Bunse C, Daly JW (2001) A controlled experiment for evaluating quality guidelines on the maintainability of object-oriented designs. IEEE Trans Softw Eng 27(6):513–530

    Article  Google Scholar 

  • Briand LC, Melo WL, Wüst J (2002) Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Trans Softw Eng 28(7):706–720

    Article  Google Scholar 

  • Chayes F (1971) Ratio correlation: a manual for students of petrology and geochemistry. University of Chicago Press, Chicago

    Google Scholar 

  • Compton BT, Withrow C (1990) Prediction and control of ada software defects. J Syst Softw 12(3):199–207

    Article  Google Scholar 

  • Cox DR (1972) Regression models and life tables. J Royal Stat Soc 34:187–220

    MATH  Google Scholar 

  • El Emam K (2005) The ROI from software quality. Auerbach Publications, Taylor and Francis Group, LLC, Boca Raton

    MATH  Google Scholar 

  • El Emam K, Benlarbi S, Goel N, Rai SN (2001) The confounding effect of class size on the validity of object-oriented metrics. IEEE Trans Softw Eng 27(7):630–650

    Article  Google Scholar 

  • El Emam K, Benlarbi S, Goel N, Melo W, Lounis H, Rai SN (2002) The optimal class size for object-oriented software. IEEE Trans Softw Eng 28(5):494–509

    Article  Google Scholar 

  • Fenton N, Pfleeger SL (1996) Software metrics: a rigorous and practical approach, 2nd edn. PWS, Boston

    Google Scholar 

  • Fenton NE, Neil M (1999) A critique of software defect prediction models. IEEE Trans Softw Eng 25(5):675–689

    Article  Google Scholar 

  • Fenton NE, Ohlsson N (2000) Quantitative analysis of faults and failures in a complex software system. IEEE Trans Softw Eng 26(8):797–814

    Article  Google Scholar 

  • Funami Y, Halstead MH (1976) A software physics analysis of akiyama’s debugging data. In: Proceedings of MRI XXIV international symposium on computer software engineering. IEEE, Piscataway, pp 133–138

    Google Scholar 

  • Gaffney JE (1984) Estimating the number of faults in code. IEEE Trans Softw Eng 10(4):459–465

    Article  MathSciNet  Google Scholar 

  • Halstead MH (1977) Elements of software science. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Harrell FE (2001) Regression modeling strategies: with applications to linear modes, logistic regression, and survival analysis. Springer, Heidelberg

    Google Scholar 

  • Harrell FE (2005) Design: design package. R package version 2.0–12. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/Design

  • Harvey AC, Collier P (1977) Testing for functional misspecification in regression analysis. J Econom 6(1):103–119

    Article  MATH  Google Scholar 

  • Hatton L (1997) Reexamining the fault density-component size connection. IEEE Softw 14(2):89–97

    Article  Google Scholar 

  • Hatton L (1998) Does oo sync with how we think? IEEE Softw 15(3):46–54

    Article  Google Scholar 

  • Hosmer DW, Lemeshow S (1999) Applied survival analysis: regression modeling of time to event data. Wiley, New York

    MATH  Google Scholar 

  • Khoshgoftaar TM, Allen EB, Hudepohl J, Aud S (1997) Applications of neural networks to software quality modeling of a very large telecommunications system. IEEE Trans Neural Netw 8(4):902–909

    Article  Google Scholar 

  • Koru AG, Tian J (2003) An empirical comparison and characterization of high defect and high complexity modules. J Syst Softw 67(3):153–163

    Article  Google Scholar 

  • Koru AG, Tian J (2004) Defect handling in medium and large open source projects. Softw IEEE 21(4):54–61

    Article  Google Scholar 

  • Koru AG, Ma L, Li Z (2003) Utilizing operational profile in refactoring large scale legacy systems. In: WCRE 2003: first IEEE international workshop on refactoring: achievements, challenges, effects, Victoria, November 2003

  • Koru AG, Zhang D, Liu, H (2007) Modeling the effect of size on defect proneness for open-source software. In: Predictor models in software engineering, PROMISE’07, 20–26 May 2007

  • Lipow M (1982) Number of faults per line of code. IEEE Trans Softw Eng 8(4):437–439

    Article  Google Scholar 

  • McCabe TJ (1976) A complexity measure. IEEE Trans Softw Eng 2(6):308–320

    Article  MathSciNet  Google Scholar 

  • Meine JPvdM, Miguel AR (2007) Correlations between internal software metrics and software dependability in a large population of small c/c++ programs. In: The 18th IEEE international symposium on software reliability. IEEE, Trollhattan, pp 203–208

  • Mockus A, Fielding RT, Herbsleb J (2002) Two case studies of open source software development: apache and mozilla. ACM Trans Softw Eng Methodol 11(3):309–346

    Article  Google Scholar 

  • Munson JC, Khoshgoftaar TM (1992) The detection of fault-prone programs. IEEE Trans Softw Eng 18(5):423–433

    Article  Google Scholar 

  • Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46:323

    Article  Google Scholar 

  • Ostrand TJ, Weyuker EJ, Bell RM (2005) Predicting the location and number of faults in large software systems. IEEE Trans Softw Eng 31(4):340–355

    Article  Google Scholar 

  • Promise (2007) Promise data repository

  • R Development Core Team (2003) R: a language and environment for statistical computing. ISBN 3-900051-00-3

  • Raymond ES (1999) The Cathedral and the Bazaar: musings on Linux and open source by an accidental revolutionary. O’Reilly, Sebastopol

    Google Scholar 

  • Rosenberg J (1997) Some misconceptions about lines of code. In: METRICS ’97: Proceedings of the 4th international symposium on software metrics. IEEE Computer Society, Washington, DC, pp 137–142

    Chapter  Google Scholar 

  • Schmidt DC (1995) Using design patterns to develop reusable object-oriented communication software. Commun ACM 38(10):65–74

    Article  Google Scholar 

  • Scientific Toolworks I (2003) Understand for c++: user guide and reference manual, January. I Scientific Toolworks, St. George

  • Shen VY, Yu TJ, Thebaut SM, Paulsen L (1985) Identifying error-prone software - an empirical study. IEEE Trans Softw Eng 11(4):317–324

    Article  Google Scholar 

  • Therneau TM (1999) Survival: survival analysis package, including penalized likelihood. R package v. 2.29. http://cran.r-project.org/web/packages/survival/index.html

  • Therneau TM, Grambsch PM (2000) Modeling survival data: extending the Cox model. Springer, Heidelberg

    MATH  Google Scholar 

  • Tian J, Troster J (1998) A comparison of measurement and defect characteristics of new and legacy software systems. J Syst Softw 44(2):135–146

    Article  Google Scholar 

  • Troster J, Tian J (1995) Defect characteristics of legacy software: measurement, visualization, regression analysis, and tree-based modeling. Technical report, IBM SWS Toronto Laboratory, March

  • Withrow C (1990) Error density and size in ada software. IEEE Softw 7(1):26–30

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Frank E. Harrell for extending and modifying some of the functionality in his Design package for us, Victor R. Basili for his helpful comments, Jeff Tian for providing data, the associate editor, Tim Menzies, for his guidance and suggestions, and the anonymous reviewers of this paper for their helpful and constructive feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Güneş Koru.

Appendix

Appendix

In this appendix, we explain how to calculate the RDP of the modules chosen by one inspection strategy with respect to those chosen by another inspection strategy. The first inspection strategy chooses m modules having sizes (in LOC), s 1,s 2,...,s m , and the second one chooses n modules having sizes, S 1,S 2,...,S n .

First, let us take a reference module with size C. Since we observed a logarithmic shape for the link function, following (2) and omitting the time parameter t to simplify the notation, the RDP of an individual module of size s with respect to this reference module at any time t would be e β(ln s − lnC). For each inspection strategy, we calculate the sum of the RDP of the selected individual modules with respect to the reference module. To find the RDP, we simply take the ratio of these sums:

(8)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koru, A.G., Emam, K.E., Zhang, D. et al. Theory of relative defect proneness. Empir Software Eng 13, 473–498 (2008). https://doi.org/10.1007/s10664-008-9080-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-008-9080-x

Keywords

Navigation