skip to main content
10.1145/2884781.2884839acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Cross-project defect prediction using a connectivity-based unsupervised classifier

Published:14 May 2016Publication History

ABSTRACT

Defect prediction on projects with limited historical data has attracted great interest from both researchers and practitioners. Cross-project defect prediction has been the main area of progress by reusing classifiers from other projects. However, existing approaches require some degree of homogeneity (e.g., a similar distribution of metric values) between the training projects and the target project. Satisfying the homogeneity requirement often requires significant effort (currently a very active area of research).

An unsupervised classifier does not require any training data, therefore the heterogeneity challenge is no longer an issue. In this paper, we examine two types of unsupervised classifiers: a) distance-based classifiers (e.g., k-means); and b) connectivity-based classifiers. While distance-based unsupervised classifiers have been previously used in the defect prediction literature with disappointing performance, connectivity-based classifiers have never been explored before in our community.

We compare the performance of unsupervised classifiers versus supervised classifiers using data from 26 projects from three publicly available datasets (i.e., AEEEM, NASA, and PROMISE). In the cross-project setting, our proposed connectivity-based classifier (via spectral clustering) ranks as one of the top classifiers among five widely-used supervised classifiers (i.e., random forest, naive Bayes, logistic regression, decision tree, and logistic model tree) and five unsupervised classifiers (i.e., k-means, partition around medoids, fuzzy C-means, neural-gas, and spectral clustering). In the within-project setting (i.e., models are built and applied on the same project), our spectral classifier ranks in the second tier, while only random forest ranks in the first tier. Hence, connectivity-based unsupervised classifiers offer a viable solution for cross and within project defect predictions.

References

  1. G. Abaei, Z. Rezaei, and A. Selamat. Fault prediction by utilizing self-organizing Map and Threshold. In 2013 IEEE International Conference on Control System, Computing and Engineering, pages 465--470. IEEE, Nov. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  2. C. C. Aggarwal, editor. Data Classification: Algorithms and Applications. CRC Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. O. F. Arar and K. Ayan. Software defect prediction using cost-sensitive neural network. Applied Soft Computing, 33:263--277, Aug. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Bettenburg, M. Nagappan, and A. E. Hassan. Think locally, act globally: Improving defect and effort prediction models. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR '12, pages 60--69, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Bishnu and V. Bhattacherjee. Software fault prediction using quad tree-based k-means clustering algorithm. IEEE Transactions on Knowledge and Data Engineering, 24(6):1146--1150, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Blanchard and D. Volchenkov. Mathematical Analysis of Urban Spatial Networks. Springer Berlin Heidelberg, Heidelberg, Germany, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. P. Borgatti and M. G. Everett. Models of core/periphery structures. Social Networks, 21(4):375--395, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  8. L. C. Briand, W. L. Melo, and J. Wüst. Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28(7):706--720, July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Catal, U. Sevim, and B. Diri. Metrics-driven software quality prediction without prior fault data. In S.-I. Ao and L. Gelman, editors, Electronic Engineering and Computing Technology, volume 60 of Lecture Notes in Electrical Engineering, pages 189--199. Springer Netherlands, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Catal, U. Sevim, and B. Diri. Practical development of an Eclipse-based software fault prediction tool using Naive Bayes algorithm. Expert Systems with Applications, 38(3):2347--2353, Mar. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Ceylan, F. Kutlubay, and A. Bener. Software Defect Identification Using Machine Learning Techniques. In 32nd EUROMICRO Conference on Software Engineering and Advanced Applications (EUROMICRO'06), pages 240--247. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Chen, B. Fang, Z. Shang, and Y. Tang. Negative samples reduction in cross-company software defects prediction. Information and Software Technology, 62:67--77, June 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Cruz and K. Ochimizu. Towards logistic regression models for predicting fault-prone code across software projects. In Proceedings of the 3rd International Symposium on Empirical Software Engineering and Measurement, pages 460--463, Oct. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. D'Ambros, M. Lanza, and R. Robbes. An extensive comparison of bug prediction approaches. In Proceedings of the 7th IEEE Working Conference on Mining Software Repositories, pages 31--41. IEEE CS Press, May 2010.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. D'Ambros, M. Lanza, and R. Robbes. Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17(4-5):531--577, Aug. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  17. I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: Spectral clustering and normalized cuts. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 551--556. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Fagan. Design and code inspections to reduce errors in program development. IBM Systems Journal, 38(2.3):258--287, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. E. Gaffney. Estimating the number of faults in code. IEEE Transactions on Software Engineering, SE-10(4):459--464, July 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Ghotra, S. McIntosh, and A. E. Hassan. Revisiting the impact of classification techniques on the performance of defect prediction models. In Proceedings of the 37th IEEE International Conference on Software Engineering, volume 1, pages 789--800, May 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Gorunescu. Data mining concepts, models and techniques. Springer, Berlin, 2011.Google ScholarGoogle Scholar
  22. D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the nasa metrics data program data sets for automated software defect prediction. In Proceedings of the 15th Annual Conference on Evaluation Assessment in Software Engineering (EASE 2011), pages 96--103, April 2011.Google ScholarGoogle ScholarCross RefCross Ref
  23. T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38(6):1276--1304, Nov. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Han, M. Kamber, and J. Pei. Data Mining: concepts and techniques. Morgan Kaufmann, Boston, 3 edition, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. E. Hassan. Predicting faults using the complexity of code changes. In Proceedings of the 31st IEEE International Conference on Software Engineering, pages 78--88, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Z. He, F. Peters, T. Menzies, and Y. Yang. Learning from open-source projects: An empirical study on defect prediction. In ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, pages 45--54, Oct. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  27. Z. He, F. Shu, Y. Yang, M. Li, and Q. Wang. An investigation on the feasibility of cross-project defect prediction. Automated Software Engineering, 19(2):167--199, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. E. G. Jelihovschi, J. C. Faria, and I. B. Allaman. Scottknott: A package for performing the scott-knott clustering algorithm in r. Trends in Applied and Computational Mathematics, 15(1):3--17, 2014.Google ScholarGoogle Scholar
  29. M. Jureczko and L. Madeyski. Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, pages 9:1--9:10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Kitchenham, L. Pickard, and S. Linkman. An evaluation of some design metrics. Software Engineering Journal, 5(1):50--58, Jan 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Lessmann, B. Baesens, C. Mues, and S. Pietsch. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering (TSE), 34(4):485--496, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Li, H. Zhang, R. Wu, and Z.-H. Zhou. Sample-based software defect prediction with active and semi-supervised learning. Automated Software Engineering, 19(2):201--230, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, Dec. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Ma, G. Luo, X. Zeng, and A. Chen. Transfer learning for cross-company software defect prediction. Information and Software Technology, 54(3):248--256, Mar. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Menzies, A. Butcher, A. Marcus, T. Zimmermann, and D. Cok. Local vs. global models for effort estimation and defect prediction. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE '11, pages 343--351. IEEE Computer Society, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. Mohar. The laplacian spectrum of graphs. In Graph Theory, Combinatorics, and Applications, pages 871--898. Wiley, 1991.Google ScholarGoogle Scholar
  37. R. Moser, W. Pedrycz, and G. Succi. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proceedings of the 30th International Conference on Software Engineering, pages 181--190. ACM, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Mullen and S. Gokhale. Software Defect Rediscoveries: A Discrete Lognormal Model. In Proceedings of the 16th IEEE International Symposium on Software Reliability Engineering, pages 203--212. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. J. Nam and S. Kim. Clami: Defect prediction on unlabeled datasets. In Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, ASE '15, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Nam and S. Kim. Heterogeneous defect prediction. In Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, ESEC/FSE '15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Nam, S. J. Pan, and S. Kim. Transfer defect learning. In Proceedings of the 2013 International Conference on Software Engineering, pages 382--391. IEEE Press, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. NASA. Metrics Data Program. http://openscience.us/repo/defect/mccabehalsted, 2015. {Online; accessed 25-August-2015}.Google ScholarGoogle Scholar
  43. A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, pages 849--856. MIT Press, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M. Ohlsson and P. Runeson. Experience from replicating empirical studies on prediction models. Proceedings of the 8th IEEE Symposium on Software Metrics, pages 217--226, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. T. Ostrand and E. Weyuker. On the automation of software fault prediction. In Testing: Academic and Industrial Conference - Practice And Research Techniques, 2006. TAIC PART 2006.Proceedings, pages 41--48, Aug. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. L. Pelayo and S. Dick. Applying novel resampling strategies to software defect prediction. In Annual Conference of the North American Fuzzy Information processing Society, NAFIPS '07, pages 69--72, June 2007.Google ScholarGoogle ScholarCross RefCross Ref
  47. M. Pinzger, N. Nagappan, and B. Murphy. Can developer-module networks predict failures? In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering, SIGSOFT '08/FSE-16, pages 2--12, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. R. Premraj and K. Herzig. Network versus code metrics to predict defects: A replication study. In 2011 International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 215--224, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. R. Rana, M. Staron, C. Berger, J. Hansson, M. Nilsson, and W. Meding. The Adoption of Machine Learning Techniques for Software Defect Prediction: An Initial Industrial Validation. Knowledge-based Software Engineering, JCKBSE 2014, 466:270--285, 2014.Google ScholarGoogle Scholar
  50. J. Romano, J. D. Kromrey, J. Coraggio, and J. Skowronek. Appropriate statistics for ordinal level data: Should we really be using t-test and cohen's d for evaluating group differences on the nsse and other surveys? In Annual Meeting of the Florida Association of Institutional Research, pages 1--33, Feb. 2006.Google ScholarGoogle Scholar
  51. M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, Sept 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures, Fourth Edition. Chapman & Hall/CRC, Jan. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, Aug. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. F. Shull, V. Basili, B. Boehm, A. Brown, P. Costa, M. Lindvall, D. Port, I. Rus, R. Tesoriero, and M. Zelkowitz. What we have learned about fighting defects. In Proceedings of the 8th IEEE Symposium on Software Metrics, pages 249--258, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. G. Tassey. The economic impacts of inadequate infrastructure for software testing. Technical Report Planning Report 02-3, National Institute of Standards and Technology, May 2002.Google ScholarGoogle Scholar
  56. A. Tosun, A. Bener, B. Turhan, and T. Menzies. Practical considerations in deploying statistical methods for defect prediction: A case study within the Turkish telecommunications industry. Information and Software Technology, 52(11):1242--1257, Nov. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. B. Turhan, T. Menzies, A. B. Bener, and J. Di Stefano. On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5):540--578, Oct. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Watanabe, H. Kaiya, and K. Kaijiri. Adapting a fault prediction model to allow inter languagereuse. In Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, PROMISE '08, pages 19--24. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. A. R. Webb and K. D. Copsey. Statistical Pattern Recognition, Third Edition. John Wiley & Sons, Inc., 2011.Google ScholarGoogle ScholarCross RefCross Ref
  60. B. Yang, Q. Yin, S. Xu, and P. Guo. Software quality prediction using affinity propagation algorithm. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks. IJCNN 2008, pages 1891--1896, June 2008.Google ScholarGoogle Scholar
  61. R. K. Yin. Case Study Research: Design and Methods - Third Edition. SAGE Publications, 3 edition, 2002.Google ScholarGoogle Scholar
  62. F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou. Towards building a universal defect prediction model. In Proceedings of the 11th Working Conference on Mining Software Repositories, MSR '14, pages 41--50, Piscataway, NJ, USA, 2014. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. F. Zhang, A. Mockus, I. Keivanloo, and Y. Zou. Towards building a universal defect prediction model with rank transformed predictors. Empirical Software Engineering, pages 1--39, 2015.Google ScholarGoogle Scholar
  64. F. Zhang, A. Mockus, Y. Zou, F. Khomh, and A. E. Hassan. How does context affect the distribution of software maintainability metrics? In Proceedings of the 29th IEEE International Conference on Software Maintainability, ICSM '13, pages 350--359, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. S. Zhong, T. Khoshgoftaar, and N. Seliya. Unsupervised learning for expert-based software quality estimation. In Proceedings of the 8th IEEE International Symposium on High Assurance Systems Engineering, pages 149--155, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. T. Zimmermann, N. Nagappan, H. Gall, E. Giger, and B. Murphy. Cross-project defect prediction: a large scale experiment on data vs. domain vs. process. In Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, ESEC/FSE '09, pages 91--100, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cross-project defect prediction using a connectivity-based unsupervised classifier

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICSE '16: Proceedings of the 38th International Conference on Software Engineering
      May 2016
      1235 pages
      ISBN:9781450339001
      DOI:10.1145/2884781

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 May 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate276of1,856submissions,15%

      Upcoming Conference

      ICSE 2025

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader