ABSTRACT
In this paper, we analyze the data extracted from several open source software repositories. We observe that the change data follows a Zipf distribution. Based on the extracted data, we then develop three probabilistic models to predict which files will have changes or bugs. The first model is Maximum Likelihood Estimation (MLE), which simply counts the number of events, i.e., changes or bugs, that happen to each file and normalizes the counts to compute a probability distribution. The second model is Reflexive Exponential Decay (RED) in which we postulate that the predictive rate of modification in a file is incremented by any modification to that file and decays exponentially. The third model is called RED-Co-Change. With each modification to a given file, the RED-Co-Change model not only increments its predictive rate, but also increments the rate for other files that are related to the given file through previous co-changes. We then present an information-theoretic approach to evaluate the performance of different prediction models. In this approach, the closeness of model distribution to the actual unknown probability distribution of the system is measured using cross entropy. We evaluate our prediction models empirically using the proposed information-theoretic approach for six large open source systems. Based on this evaluation, we observe that of our three prediction models, the RED-Co-Change model predicts the distribution that is closest to the actual distribution for all the studied systems.
- Allen, J. F. Using Entropy for Evaluating and Comparing Probability Distributions, available at: http://www.cs.rochester.edu/u/james/CSC248/Lec6.pdfGoogle Scholar
- Basili, V. R., and Perricone, B. Software errors and complexity: An empirical investigation. Communications of the ACM, 27(1):42--52, 1984. Google ScholarDigital Library
- Eick, S. G., Graves, T. L., Karr, A. F., Marron, J. S., and Mockus, A. Does Code Decay? Assessing the Evidence from Change Management Data. IEEE Trans. on Software Engineering, 27(1):1--12, 2001. Google ScholarDigital Library
- Eick, S. G., Graves, T. L., Karr, A. F., Mockus, A., Schuster, P. Visualizing Software Changes, IEEE Trans. on Software Engineering, vol. 28, no. 4, pp. 396--412, April, 2002. Google ScholarDigital Library
- Gall, H., Hajek, K., and Jazayeri, M. Detection of logical coupling based on product release history. In Proceedings of the 14th International Conference on Software Maintenance, Bethesda, Washington D.C., November 1998. Google ScholarDigital Library
- Graves, T. L., Karr, A. F., Marron, J. S. and Siy, H. P. Predicting fault incidence using software change history. IEEE Trans. on Software Engineering, 26(7):653--661, 2000. Google ScholarDigital Library
- Hassan, A. E., Mining Software Repositories to Assist Developers and Support Managers. PhD Thesis, University of Waterloo, Ontario, Canada, 2004. Google ScholarDigital Library
- Hassan, A. E. and Holt, R. C., The Top Ten List: Dynamic Fault Prediction, Proceedings of ICSM 2005: International Conference on Software Maintenance, Budapest, Hungary, Sept 25--30, 2005. Google ScholarDigital Library
- Khoshgoftaar, T. M., Allen, E. B., Halstead, R., Trio, G. P. and Flass, R. M. Using Process History to Predict Software Quality. Computer, 31(4), 1998. Google ScholarDigital Library
- Khoshgoftaar, T. M., Allen, E. B., Jones, W. D., and Hudepohl, J. P. Data Mining for Predictors of Software Quality. International Journal of Software Engineering and Knowledge Engineering, 9(5), 1999.Google Scholar
- Manning, C. and Schütze, H. Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999. Google ScholarDigital Library
- Mockus, A. and Votta, L. G. Identifying reasons for software change using historic databases. In International Conference on Software Maintenance, pages 120-130, San Jose, California, October 11-14 2000. Google ScholarDigital Library
- Mockus, A., Weiss, D. M., and Zhang, Ping. Understanding and predicting effort in software projects. In 2003 International Conference on Software Engineering, pages 274--284, Portland, Oregon, May 3-10 2003. ACM Press. Google ScholarDigital Library
- Ostrand, T. J., Weyuker, E. J., Bell, R. M. Predicting the Location and Number of Faults in Large Software Systems. IEEE Trans. Software Eng. 31(4): 340--355 (2005). Google ScholarDigital Library
- Pareto Law: http://www.it-cortex.com/Pareto_law.htmGoogle Scholar
- Perry, D. E. and Evangelist, W. M. An Empirical Study of Software Interface Faults - An Update. In Proceedings of the 20th Annual Hawaii International Conference on Systems Sciences, pages 113--136, Hawaii, USA, January 1987.Google Scholar
- Perry, D. E. and Steig, C.S. Software Faults in Evolving a Large, Real-Time System: a Case Study'. In Proceedings of the 4th European Software Engineering Conference, Garmisch, Germany, September 1993. Google ScholarDigital Library
- Reliability Analysis Center, Introduction to Software Reliability: A state of the Art Review. Reliability Analysis Center (RAC), 1996. http://rome.iitri.com/RAC/Google Scholar
- Zimmermann, T., Weißgerber, P., Diehl, S., Zeller, A. Mining Version Histories to Guide Software Changes, IEEE Trans. on Software Engineering, vol. 31, no. 6, pp. 429--445, June, 2005. Google ScholarDigital Library
- Zipf, G. K. Human Behavior and the Principle of Least Effort. Addison-Wesley, 1949.Google Scholar
Index Terms
- Information theoretic evaluation of change prediction models for large-scale software
Recommendations
Predicting software bugs using ARIMA model
ACM SE '10: Proceedings of the 48th Annual Southeast Regional ConferenceThe number of software products available in market is increasing rapidly. Many a time, multiple companies develop software products of similar functionalities. Thus the competition among those owning companies is becoming tougher every day. Moreover, ...
Experience With the Accuracy of Software Maintenance Task Effort Prediction Models
This paper reports experience from the development and use of eleven different software maintenance effort prediction models. The models were developed applying regression analysis, neural networks and pattern recognition and the prediction accuracy was ...
Contributing Features-Based Schemes for Software Defect Prediction
Artificial Intelligence XXXVIAbstractAutomated defect prediction of large and complex software systems is a challenging task. However, by utilising correlated quality metrics, a defect prediction model can be devised to automatically predict the defects in a software system. The ...
Comments