skip to main content
10.1145/1137983.1138013acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Information theoretic evaluation of change prediction models for large-scale software

Published:22 May 2006Publication History

ABSTRACT

In this paper, we analyze the data extracted from several open source software repositories. We observe that the change data follows a Zipf distribution. Based on the extracted data, we then develop three probabilistic models to predict which files will have changes or bugs. The first model is Maximum Likelihood Estimation (MLE), which simply counts the number of events, i.e., changes or bugs, that happen to each file and normalizes the counts to compute a probability distribution. The second model is Reflexive Exponential Decay (RED) in which we postulate that the predictive rate of modification in a file is incremented by any modification to that file and decays exponentially. The third model is called RED-Co-Change. With each modification to a given file, the RED-Co-Change model not only increments its predictive rate, but also increments the rate for other files that are related to the given file through previous co-changes. We then present an information-theoretic approach to evaluate the performance of different prediction models. In this approach, the closeness of model distribution to the actual unknown probability distribution of the system is measured using cross entropy. We evaluate our prediction models empirically using the proposed information-theoretic approach for six large open source systems. Based on this evaluation, we observe that of our three prediction models, the RED-Co-Change model predicts the distribution that is closest to the actual distribution for all the studied systems.

References

  1. Allen, J. F. Using Entropy for Evaluating and Comparing Probability Distributions, available at: http://www.cs.rochester.edu/u/james/CSC248/Lec6.pdfGoogle ScholarGoogle Scholar
  2. Basili, V. R., and Perricone, B. Software errors and complexity: An empirical investigation. Communications of the ACM, 27(1):42--52, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Eick, S. G., Graves, T. L., Karr, A. F., Marron, J. S., and Mockus, A. Does Code Decay? Assessing the Evidence from Change Management Data. IEEE Trans. on Software Engineering, 27(1):1--12, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Eick, S. G., Graves, T. L., Karr, A. F., Mockus, A., Schuster, P. Visualizing Software Changes, IEEE Trans. on Software Engineering, vol. 28, no. 4, pp. 396--412, April, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gall, H., Hajek, K., and Jazayeri, M. Detection of logical coupling based on product release history. In Proceedings of the 14th International Conference on Software Maintenance, Bethesda, Washington D.C., November 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Graves, T. L., Karr, A. F., Marron, J. S. and Siy, H. P. Predicting fault incidence using software change history. IEEE Trans. on Software Engineering, 26(7):653--661, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hassan, A. E., Mining Software Repositories to Assist Developers and Support Managers. PhD Thesis, University of Waterloo, Ontario, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hassan, A. E. and Holt, R. C., The Top Ten List: Dynamic Fault Prediction, Proceedings of ICSM 2005: International Conference on Software Maintenance, Budapest, Hungary, Sept 25--30, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Khoshgoftaar, T. M., Allen, E. B., Halstead, R., Trio, G. P. and Flass, R. M. Using Process History to Predict Software Quality. Computer, 31(4), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., and Hudepohl, J. P. Data Mining for Predictors of Software Quality. International Journal of Software Engineering and Knowledge Engineering, 9(5), 1999.Google ScholarGoogle Scholar
  11. Manning, C. and Schütze, H. Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mockus, A. and Votta, L. G. Identifying reasons for software change using historic databases. In International Conference on Software Maintenance, pages 120-130, San Jose, California, October 11-14 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mockus, A., Weiss, D. M., and Zhang, Ping. Understanding and predicting effort in software projects. In 2003 International Conference on Software Engineering, pages 274--284, Portland, Oregon, May 3-10 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ostrand, T. J., Weyuker, E. J., Bell, R. M. Predicting the Location and Number of Faults in Large Software Systems. IEEE Trans. Software Eng. 31(4): 340--355 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Pareto Law: http://www.it-cortex.com/Pareto_law.htmGoogle ScholarGoogle Scholar
  16. Perry, D. E. and Evangelist, W. M. An Empirical Study of Software Interface Faults - An Update. In Proceedings of the 20th Annual Hawaii International Conference on Systems Sciences, pages 113--136, Hawaii, USA, January 1987.Google ScholarGoogle Scholar
  17. Perry, D. E. and Steig, C.S. Software Faults in Evolving a Large, Real-Time System: a Case Study'. In Proceedings of the 4th European Software Engineering Conference, Garmisch, Germany, September 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Reliability Analysis Center, Introduction to Software Reliability: A state of the Art Review. Reliability Analysis Center (RAC), 1996. http://rome.iitri.com/RAC/Google ScholarGoogle Scholar
  19. Zimmermann, T., Weißgerber, P., Diehl, S., Zeller, A. Mining Version Histories to Guide Software Changes, IEEE Trans. on Software Engineering, vol. 31, no. 6, pp. 429--445, June, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zipf, G. K. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.Google ScholarGoogle Scholar

Index Terms

  1. Information theoretic evaluation of change prediction models for large-scale software

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                MSR '06: Proceedings of the 2006 international workshop on Mining software repositories
                May 2006
                191 pages
                ISBN:1595933972
                DOI:10.1145/1137983

                Copyright © 2006 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 22 May 2006

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Upcoming Conference

                ICSE 2025

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader