Reflections on the NASA MDP data sets

D. Gray; D. Bowes; N. Davey; Y. Sun; B. Christianson

Reflections on the NASA MDP data sets

Access Full Text

Reflections on the NASA MDP data sets

Author(s): D. Gray ; D. Bowes ; N. Davey ; Y. Sun ; B. Christianson
DOI: 10.1049/iet-sen.2011.0132

For access to this article, please select a purchase option:

Buy article PDF

Buy Knowledge Pack

IET members benefit from discounts to all IET publications and free access to E&T Magazine. If you are an IET member, log in to your account and the discounts will automatically be applied.

Learn more about IET membership

Recommend Title Publication to library

IET Software — Recommend this title to your library

Thank you

Your recommendation has been sent to your librarian.

Author(s): D. Gray ¹ ; D. Bowes ¹ ; N. Davey ¹ ; Y. Sun ¹ ; B. Christianson ¹
- Affiliations: 1: Computer Science Department, University of Hertfordshire, UK
Source: Volume 6, Issue 6, December 2012, p. 549 – 558
DOI: 10.1049/iet-sen.2011.0132 , Print ISSN 1751-8806, Online ISSN 1751-8814

Published

Background: The NASA metrics data program (MDP) data sets have been heavily used in software defect prediction research. Aim: To highlight the data quality issues present in these data sets, and the problems that can arise when they are used in a binary classification context. Method: A thorough exploration of all 13 original NASA data sets, followed by various experiments demonstrating the potential impact of duplicate data points when data mining. Conclusions: Firstly researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Secondly, the bulk of defect prediction experiments based on the NASA MDP data sets may have led to erroneous findings. This is mainly because of repeated/duplicate data points potentially causing substantial amounts of training and testing data to be identical.

References

1. 1)
  - Williams, C., Spacco, J.: `SZZ revisited: verifying when changes induce fixes', Proc. 2008 Workshop on Defects in Large Software Systems. DEFECTS’08, 2008, New York, USA, p. 32–36.
2. 2)
  - T. Howley , M.G. Madden , M.L. O'Connell , A.G. Ryder . The effect of principal component analysis on machine learning accuracy with high-dimensional spectral data. Knowl.-Based Syst. , 5 , 363 - 370
3. 3)
  - G.E.A.P.A. Batista , R.C. Prati , M.C. Monard . A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. , 20 - 29
4. 4)
  - Menzies, T., Stefano, J.S.D., Orrego, A., Chapman, R.: `Assessing predictors of software defects', Proc. Workshop on Predictive Software Models, 2004.
5. 5)
  - Jiang, Y., Cukic, B., Menzies, T.: `Fault prediction using early lifecycle data', 18thIEEE Int. Symp. on Software Reliability, 2007. ISSRE’07, 2007, p. 237–246.
6. 6)
  - Singh, Y., Kaur, A., Malhotra, R.: `Predicting software fault proneness model using neural network', Proc. Ninth Int. Conf. on Product-Focused Software Process Improvement. PROFES’08, 2008, Berlin, Heidelberg, p. 204–214.
7. 7)
  - Y. Ma , L. Guo , B. Cukic . (2006) A statistical framework for the prediction of fault-proneness, Advances in machine learning application in software engineering.
8. 8)
  - Q. Song , Z. Jia , M. Shepperd , S. Ying , J. Liu . A general software defect-proneness prediction framework. IEEE Trans. Softw. Eng. , 3 , 356 - 370
9. 9)
  - Cieslak, D.A., Chawla, N.V., Striegel, A.: `Combating imbalance in network intrusion datasets', 2006 IEEE Int. Conf. Granular Computing, 2006, p. 732–737.
10. 10)
  - A.G. Koru , H. Liu . An investigation of the effect of module size on defect prediction using static measures. ACM SIGSOFT Softw. Eng. Notes , 4 , 1 - 5
11. 11)
  - S. Lessmann , B. Baesens , C. Mues , S. Pietsch . Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Trans. Softw. Eng. , 4 , 485 - 496
12. 12)
  - H. Zhang , X. Zhang . Comments on “data mining static code attributes to learn defect predictors”. IEEE Trans. Softw. Eng. , 635 - 637
13. 13)
  - Kim, S., Zimmermann, T., Pan, K., Whitehead, E.J.J.: `Automatic identification of bug-introducing changes', ASE'06: Proc. 21st IEEE/ACM Int. Conf. on Automated Software Engineering, 2006, Washington, DC, USA, p. 81–90.
14. 14)
  - Jiang, Y., Cukic, B., Menzies, T., Bartlow, N.: `Comparing design and code metrics for software quality prediction', Proc. Fourth Int. Workshop on Predictor Models in Software Engineering. PROMISE’08, 2008, New York, USA, p. 11–18.
15. 15)
  - Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: `The misuse of the NASA metrics data program data sets for automated software defect prediction', Evaluation and Assessment in Software Engineering (EASE), 2011, p. 96–103.
16. 16)
  - Kutlubay, O., Turhan, B., Bener, A.B.: `A two-step model for defect density estimation', 33rdEUROMICRO Conf. Software Engineering and Advanced Applications, 2007, p. 322–332.
17. 17)
  - Vivanco, R.A., Kamei, Y., Monden, A., Matsumoto, K.-i., Jin, D.: `Using search-based metric selection and oversampling to predict fault prone modules', IEEE CCECE, 2010, p. 1–6.
18. 18)
  - Tao, W., Wei-hua, L.: `Naive Bayes software defect prediction model', 2010 Int. Conf. on Computational Intelligence and Software Engineering (CiSE), 2010, p. 1–4.
19. 19)
  - N.V. Chawla , N. Japkowicz , A. Kolcz . Special issue on learning from imbalanced datasets. SIGKDD Explor. Newsl. , 1 , 1 - 6
20. 20)
  - Mertik, M., Lenic, M., Stiglic, G., Kokol, P.: `Estimating software quality with advanced data mining techniques', Int. Conf. on Software Engineering Advances, 2006, p. 19.
21. 21)
  - Guo, L., Ma, Y., Cukic, B., Singh, H.: `Robust prediction of fault-proneness by random forests', 15thInt. Symp. on Software Reliability Engineering. ISSRE 2004, 2004, p. 417–428.
22. 22)
  - Nickerson, A.S., Japkowicz, N., Milios, E.: `Using unsupervised learning to guide resampling in imbalanced data sets', Proc. Eighth Int. Workshop on AI and Statistics, 2001, p. 261–265.
23. 23)
  - Mende, T., Koschke, R.: `Effort-aware defect prediction models', European Conf. on Software Maintenance and Reengineering, 2010, p. 107–116.
24. 24)
  - T. Menzies , Z. Milton , B. Turhan , B. Cukic , Y. Jiang , A. Bener . Defect prediction from static code features: current results, limitations, new approaches. Autom. Softw. Eng. , 4 , 375 - 407
25. 25)
  - Davis, J., Goadrich, M.: `The relationship between Precision-Recall and ROC curves', Proc. 23rd Int. Conf. on Machine Learning. ICML'06, 2006, New York, USA, p. 233–240.
26. 26)
  - Cong, J., En-Mei, D., Li-Na, Q.: `Software fault prediction model based on adaptive dynamical and median particle swarm optimization', 2010 Second Int. Conf. on Multimedia and Information Technology (MMIT), 2010, 1, p. 44–47.
27. 27)
  - B. Turhan , T. Menzies , A.B. Bener , J. Di Stefano . On the relative value of cross-company and within-company data for defect prediction. Empir. Softw. Eng. , 540 - 578
28. 28)
  - Khoshgoftaar, T.M., Seliya, N.: `The necessity of assuring quality in software measurement data', METRICS'04: Proc. Software Metrics, 10th Int. Symp., 2004, Washington, DC, USA, p. 119–130.
29. 29)
  - Menzies, T., Stefano, J.S.D.: `How good is your blind spot sampling policy?', Proc. Eighth IEEE Int. Symp. on High Assurance Systems Engineering, 2004, p. 129–138.
30. 30)
  - K.O. Elish , M.O. Elish . Predicting defect-prone software modules using support vector machines. J. Syst. Softw. , 5 , 649 - 660
31. 31)
  - C. Chen , A. Liaw , L. Breiman . (2004) Using random forest to learn imbalanced data.
32. 32)
  - Challagulla, V.U.B., Bastani, F.B., Yen, I.L.: `A unified framework for defect data analysis using the mbr technique', ICTAI’06: Proc. 18th IEEE Int. Conf. on Tools with Artificial Intelligence, 2006, Washington, DC, USA, p. 39–46.
33. 33)
  - T.J. McCabe . (1976) A complexity measure’. ICSE’76, Proc. Second Int. Conf. on Software Engineering.
34. 34)
  - H. He , E.A. Garcia . Learning from Imbalanced Data. IEEE Trans. Know. Data Eng. , 1263 - 1284
35. 35)
  - Zhang, H., Nelson, A., Menzies, T.: `On the value of learning from defect dense components for software defect prediction', Proc. Sixth Int. Conf. on Predictive Models in Software Engineering. PROMISE’10, 2010, New York, USA, p. 14:1–14:9.
36. 36)
  - Schröter, A., Zimmermann, T., Zeller, A.: `Predicting component failures at design time', Proc. Fifth Int. Symp. on Empirical Software Engineering, 2006, p. 18–27.
37. 37)
  - Kołcz, A., Chowdhury, A., Alspector, J.: `Data duplication: an imbalance problem?', ICML 2003 Workshop on Learning from Imbalanced Datasets, 2003.
38. 38)
  - Pelayo, L., Dick, S.: `Applying novel resampling strategies to software defect prediction', Annual Meeting of the North American Fuzzy Information Processing Society, 2007. NAFIPS'07, 2007, p. 69–72.
39. 39)
  - Y. Jiang , B. Cukic , Y. Ma . Techniques for evaluating fault prediction models. Empir. Softw. Eng. , 5 , 561 - 595
40. 40)
  - Jiang, Y., Cukic, B.: `Misclassification cost-sensitive fault prediction models', Proc. Fifth Int. Conf. on Predictor Models in Software Engineering. PROMISE’09, 2009, New York, USA, p. 20:1–20:10.
41. 41)
  - Bezerra, M.E.R., Oliveira, A.L.I., Meira, S.R.L.: `A constructive RBF neural network for estimating the probability of defects in software modules', Int. Joint Conf. on Neural Networks, 2007. IJCNN 2007, 2007, p. 2869–2874.
42. 42)
  - A.G. Koru , H. Liu . Building effective defect-prediction models in practice. IEEE Softw. , 6 , 23 - 29
43. 43)
  - Li, Z., Reformat, M.: `A practical method for the software fault-prediction', IEEE Int. Conf. on Information Reuse and Integration. IRI 2007, August 2007, p. 659–666.
44. 44)
  - M.A. Hall . (1999) Correlation-based feature subset selection for machine learning.
45. 45)
  - M.H. Halstead . (1977) Elements of software science (operating and programming systems series).
46. 46)
  - Gray, D., Bowes, D., Davey, N., Sun, Y., Christianson, B.: `Further thoughts on precision', Evaluation and Assessment in Software Engineering (EASE), 2011, p. 129–133.
47. 47)
  - O. Vandecruys , D. Martens , B. Baesens , C. Mues , M. De Backer , R. Haesen . Mining software repositories for comprehensible software fault prediction models. J. Syst. Softw. , 5 , 823 - 839
48. 48)
  - Shivaji, S., Whitehead, E.J., Akella, R., Kim, S.: `Reducing features to improve bug prediction', 24thIEEE/ACM Int. Conf. Automated Software Engineering, 2009. ASE’09, 2009, p. 600–604.
49. 49)
  - G.D. Boetticher . Improving credibility of machine learner models in software engineering’. Advanced Machine Learner Applications in Software Engineering (Series on Software Engineering and Knowledge Engineering) , 52 - 72
50. 50)
  - Liebchen, G.A., Shepperd, M.: `Data sets and data quality in software engineering', PROMISE’08: Proc. Fourth Int. Workshop on Predictor Models in Software Engineering, 2008, New York, USA, p. 39–44.
51. 51)
  - Challagulla, V.U.B., Bastani, F.B., Yen, I.L., Paul, R.A.: `Empirical assessment of machine learning based software defect prediction techniques', WORDS’05: Proc. 10th IEEE Int. Workshop on Object-Oriented Real-Time Dependable Systems, 2005, Washington, DC, USA, p. 263–270.
52. 52)
  - I.H. Witten , E. Frank . (2005) Data mining: practical machine learning tools and techniques, Morgan Kaufmann series in data management systems.
53. 53)
  - Oral, A.D., Bener, A.B.: `Defect prediction for embedded software', 22ndInt. Symp. on Computer and Information Sciences, 2007. ISCIS 2007, 2007, p. 1–6.
54. 54)
  - B. Turhan , G. Kocak , A. Bener . Data mining source code for locating software bugs: a case study in telecommunication industry. Expert Syst. Appl. , 6 , 9986 - 9990
55. 55)
  - Y. Liu , T.M. Khoshgoftaar , N. Seliya . Evolutionary optimization of software quality modeling with multiple repositories. IEEE Trans. Softw. Eng. , 6 , 852 - 864
56. 56)
  - Guo, L., Cukic, B., Singh, H.: `Predicting fault prone modules by the Dempster–Shafer belief networks', Proc. 18th IEEE Int. Conf. on Automated Software Engineering, 2003, p. 249–252.
57. 57)
  - Menzies, T., Turhan, B., Bener, A., Gay, G., Cukic, B., Jiang, Y.: `Implications of ceiling effects in defect predictors', Proc. Fourth Int. Workshop on Predictor Models in Software Engineering. PROMISE’08, 2008, New York, USA, p. 47–54.
58. 58)
  - M.R. Segal . (2004) Machine learning benchmarks and random forest regression.
59. 59)
  - Zhong, S., Khoshgoftaar, T.M., Seliya, N.: `Unsupervised learning for expert-based software quality estimation', Proc. Eighth IEEE Int. Symp. on High Assurance Systems Engineering, 2004, p. 149–155.
60. 60)
  - T. Menzies , J. Greenwald , A. Frank . Data mining static code attributes to learn defect predictors. IEEE Trans. Softw. Eng. , 1 , 2 - 13
61. 61)
  - S. Kim , E. James , J. Whitehead , Y. Zhang . Classifying software changes: clean or buggy?. IEEE Trans. Softw. Eng. , 2 , 181 - 196
62. 62)
  - Tosun, A., Bener, A.: `Reducing false alarms in software defect prediction by decision threshold optimization', Proc. 2009 Third Int. Symp. on Empirical Software Engineering and Measurement. ESEM’09, 2009, Washington, DC, USA, p. 477–480.
63. 63)
  - Zhang, H.: `An investigation of the relationships between lines of code and defects', IEEE Int. Conf. on Software Maintenance, 2009. ICSM 2009, 2009, p. 274–283.
64. 64)
  - Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: `SMOTEBoost: improving prediction of the minority class in boosting', Proc. Principles of Knowledge Discovery in Databases (PKDD-2003), 2003, p. 107–119.
65. 65)
  - Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.: `Detecting fault modules applying feature selection to classifiers', IEEE Int. Conf. on Information Reuse and Integration. IRI 2007, 2007, p. 667–672.
66. 66)
  - A.B. de Carvalho , A. Pozo , S.R. Vergilio . A symbolic fault-prediction model based on multiobjective particle swarm optimization. J. Syst. Softw. , 5 , 868 - 882
67. 67)
  - Kaminsky, K., Boetticher, G.: `Building a genetically engineerable evolvable program (GEEP) using breadth-based explicit knowledge for predicting software defects', IEEE Annual Meeting of the Fuzzy Information, 2004. Processing NAFIPS'04, 2004, 1, p. 10–15.
68. 68)
  - Jiang, Y., Cukic, B., Menzies, T.: `Can data transformation help in the detection of fault-prone modules?', DEFECTS‘08: Proc. 2008 Workshop on Defects in Large Software Systems, 2008, New York, USA, p. 16–20.
69. 69)
  - S. Dudoit , J. Fridlyand . (2003) Classification in microarray experiments, Statistical analysis of gene expression microarray data.
70. 70)
  - Mende, T., Koschke, R.: `Revisiting the evaluation of defect prediction models', Proc. Fifth Int. Conf. on Predictor Models in Software Engineering. PROMISE’09, 2009, New York, USA, p. 7:1–7:10.
71. 71)
  - Turhan, B., Bener, A.: `A multivariate analysis of static code attributes for defect prediction', QSIC’07: Proc. Seventh Int. Conf. on Quality Software, 2007, Washington, DC, USA, p. 231–237.
72. 72)
  - T.M. Khoshgoftaar , E.B. Allen . Ordering fault-prone software modules. Softw. Qual. Control , 19 - 37
73. 73)
  - Seliya, N., Khoshgoftaar, T.M., Zhong, S.: `Analyzing software quality with limited fault-proneness defect data', Ninth IEEE Int. Symp. on High-Assurance Systems Engineering, 2005. HASE 2005, 2005, p. 89–98.
74. 74)
  - N.V. Chawla , K.W. Bowyer , L.O. Hall , W.P. Kegelmeyer . SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. , 321 - 357
75. 75)
  - J. Śliwerski , T. Zimmermann , A. Zeller . When do changes induce fixes?. SIGSOFT Softw. Eng. Notes , 4 , 1 - 5

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Reflections on the NASA MDP data sets

Reflections on the NASA MDP data sets

Buy article PDF

Buy Knowledge Pack

Thank you

References

Related content