Abstract
The quality of the defect datasets is a critical issue in the domain of software defect prediction (SDP). These datasets are obtained through the mining of software repositories. Recent studies claim over the quality of the defect dataset. It is because of inconsistency between bug/clean fix keyword in fault reports and the corresponding link in the change management logs. Class Imbalance (CI) problem is also a big challenging issue in SDP models. The defect prediction method trained using noisy and imbalanced data leads to inconsistent and unsatisfactory results. Combined analysis over noisy instances and CI problem needs to be required. To the best of our knowledge, there are insufficient studies that have been done over such aspects. In this paper, we deal with the impact of noise and CI problem on five baseline SDP models; we manually added the various noise level (0–80%) and identified its impact on the performance of those SDP models. Moreover, we further provide guidelines for the possible range of tolerable noise for baseline models. We have also suggested the SDP model, which has the highest noise tolerable ability and outperforms over other classical methods. The True Positive Rate (TPR) and False Positive Rate (FPR) values of the baseline models reduce between 20–30% after adding 10–40% noisy instances. Similarly, the ROC (Receiver Operating Characteristics) values of SDP models reduce to 40–50%. The suggested model leads to avoid noise between 40–60% as compared to other traditional models.
Similar content being viewed by others
References
Abaei G, Selamat A, Fujita H (2015) An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction. Knowl Based Syst 74:28–39
Alan O, Catal C (2011) Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets. Expert Syst Appl 38(4):3440–3445
Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. Int J Comput Sci Issues (IJCSI) 9(5):272
Ali A, Mariyam SS, Ralescu AL (2015) Classification with class imbalance problem: a review. Int J Adv Soft Comput Appl 7(3):176–204
Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 97–106
Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636
Bhargava N, Sharma G, Bhargava R, Mathuria M (2013) Decision tree analysis on j48 algorithm for data mining. Proc Int J Adv Res Comput Sci Softw Eng 3(6)
Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, pp 121–130
Cabral GG, Minku LL, Shihab E, Mujahid S (2019) Class imbalance evolution and verification latency in just-in-time software defect prediction. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 666–676
Catal C (2014) A comparison of semi-supervised classification approaches for software defect prediction. J Intell Syst 23(1):75–82
Catal C, Diri B (2007) Software defect prediction using artificial immune recognition system. In: Proceedings of the 25th conference on IASTED international multi-conference: software engineering. ACTA Press, pp 285–290
Catal C, Diri B (2009) Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf Sci 179(8):1040–1058
Catal C, Diri B, Ozumut B (2007) An artificial immune system approach for fault prediction in object-oriented software. In: 2nd International conference on dependability of computer systems (DepCoS-RELCOMEX’07). IEEE, pp 238–245
Catal C, Sevim U, Diri B (2010) Metrics-driven software quality prediction without prior fault data. In: Electronic engineering and computing technology. Springer, Berlin, pp 189–199
Catal C, Alan O, Balkan K (2011a) Class noise detection based on software metrics and ROC curves. Inf Sci 181(21):4867–4877
Catal C, Sevim U, Diri B (2011b) Practical development of an eclipse-based software fault prediction tool using Naive Bayes algorithm. Expert Syst Appl 38(3):2347–2353
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16
Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, Berlin, pp 875–886
Chawla NV, Bowyer KW, Hall LO, Philip KW (2002) Smote synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125
Clarkson KL, Shor PW (1989) Applications of random sampling in computational geometry, ii. Discret Comput Geom 4(5):387–421
Davies D (1995) Parallel processing with subsampling/spreading circuitry and data transfer circuitry to and from any processing unit
Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240
Dor O, Zhou Y (2007) Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins Struct Funct Bioinform 66(4):838–845
Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660
Fan Y, Alencar da Costa D, Lo D, Hassan AE, Shanping L (2020) The impact of mislabeled changes by SZZ on just-in-time defect prediction. IEEE Trans Softw Eng
Hall T, Beecham S, Bowes D, Gray D, Steve Counsell (2011) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304
Hou S, Li Y (2009) Short-term fault prediction based on support vector machines with parameter optimization by evolution strategy. Expert Syst Appl 36(10):12383–12391
Hu W, Hu W, Maybank S (2008) Adaboost-based algorithm for network intrusion detection. IEEE Trans Syst Man Cybern Part B (Cybern) 38(2):577–583
Huang Y, Li L (2011) Naive Bayes classification algorithm based on small sample set. In: 2011 IEEE international conference on cloud computing and intelligence systems, pp 34–39
Ji H, Huang S, Wu Y, Hui Z, Zheng C (2019) A new weighted Naive Bayes method based on information diffusion for software defect prediction. Softw Qual J 1–46
Joon A, Kumar TR, Kumar K (2020) Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite. In: 2020 5th International conference on communication and electronics systems (ICCES). IEEE, pp 1381–1389
Kan SH (2002) Metrics and models in software quality engineering. Addison-Wesley Longman Publishing Co., Inc
Kaur A, Malhotra R (2008) Application of random forest in predicting fault-prone classes. In: 2008 International conference on advanced computer theory and engineering. IEEE, pp 37–43
Khanh DH, Pham T, Wee NS, Tran T, Grundy J, Ghose A, Kim T, Kim C-J (2018) A deep tree-based model for software defect prediction. arXiv preprint arXiv:1802.00921
Kim S, Zimmermann T, Pan K, James Jr E et al (2006) Automatic identification of bug-introducing changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06). IEEE, pp 81–90
Kim S, James WE Jr, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34(2):181–196
Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: 2011 33rd International conference on software engineering (ICSE). IEEE, pp 481–490
Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinformatics 1(1):46–55
Lam FC, Longnecker MT (1983) A modified Wilcoxon rank sum test for paired data. Biometrika 70(2):510–513
Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe. Springer, Berlin, pp 63–66
Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165
Li M, Zhang H, Wu R, Zhou Z-H (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230
Li J, He P, Zhu J, Lyu MR (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS). IEEE, pp 318–328
Li L, Lessmann S, Baesens B (2019) Evaluating software defect prediction performance: an updated benchmarking study. arXiv preprint arXiv:1901.01726
Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22
Limsettho N, Bennin KE, Keung JW, Hata H, Matsumoto K (2018) Cross project defect prediction using class distribution estimation and oversampling. Inf Softw Technol 100:87–102
Linberg KR (1999) Software developer perceptions about software project failure: a case study. J Syst Softw 49(2–3):177–192
Liu Y, Khoshgoftaar T (2004) Reducing overfitting in genetic programming models for software quality classification. In: Eighth IEEE international symposium on high assurance systems engineering, 2004. Proceedings. IEEE, pp 56–65
Lu H, Cukic B, Culp M (2012) Software defect prediction using semi-supervised learning with dimension reduction. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering. IEEE, pp 314–317
Maletic JI, Marcus A (2000) Data cleansing: beyond integrity analysis. In: Iq. Citeseer, pp 200–209
Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518
Moore AW (2001) Cross-validation for detecting and preventing overfitting. School of Computer Science Carneigie Mellon University
Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering. ACM, pp 181–190
Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 452–461
Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th International conference on software engineering (ICSE). IEEE, pp 382–391
Obuchowski NA (1997) Nonparametric analysis of clustered roc curve data. Biometrics 567–578
Offutt AJA (1992) Investigations of the software testing coupling effect. ACM Trans Softw Eng Methodol (TOSEM) 1(1):5–20
Pandey SK, Tripathi AK (2020) BCV-predictor: a bug count vector predictor of a successive version of the software system. Knowl Based Syst 197:105924
Pandey SK, Mishra RB, Triphathi AK (2018) Software bug prediction prototype using Bayesian network classifier: a comprehensive model. Procedia Comput Sci 132:1412–1421
Pandey SK, Mishra RB, Tripathi AK (2020) BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques. Expert Syst Appl 144:113085
Pandey SK, Rathee D, Tripathi AK (2020) Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study. IET Softw 14(7):768–782
Pandey SK, Mishra RB, Tripathi AK (2021) Machine learning based methods for software fault prediction: a survey. Expert Syst Appl 114595
Peng Y, Kou G, Wang G, Wu W, Shi Y (2011) Ensemble of software defect predictors: an AHP-based evaluation method. Int J Inf Technol Decis Mak 10(01):187–206
Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: 2013 10th Working conference on mining software repositories (MSR). IEEE, pp 409–418
Queiroz R, Berger T, Czarnecki K (2016) Towards predicting feature defects in software product lines. In: Proceedings of the 7th international workshop on feature-oriented software development. ACM, pp 58–62
Rätsch G, Onoda T, Müller KR (1998) An improvement of adaboost to avoid overfitting. In: Proceedings of the international conference on neural information processing. Citeseer
Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics: an systematic literature review. Inf Softw Technol 55(8):1397–1418
Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: 2013 35th International conference on software engineering (ICSE). IEEE, pp 432–441
Ram N, Gerstorf D, Fauth E, Zarit S, Malmberg B (2010) Aging, disablement, and dying: using time-as-process and time-as-resources metrics to chart late-life change. Res Hum Dev 7(1):27–44
Ramler R, Himmelbauer J (2013) Noise in bug report data and the impact on defect prediction results. In: 2013 Joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. IEEE, pp 173–180
Riaz S, Arshad A, Jiao L (2018) Rough noise-filtered easy ensemble for software fault prediction. IEEE Access 6:46886–46899
Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Roobaert D, Karakoulas G, Chawla NV (2006) Information gain, correlation and support vector machines. In: Feature extraction. Springer, Berlin, pp 463–470
Schneider GM, Martin J, Tsai W-T (1992) An experimental study of fault detection in user requirements documents. ACM Trans Softw Eng Methodol (TOSEM) 1(2):188–204
Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595
Seliya N, Khoshgoftaar TM (2007) Software quality estimation with limited fault data: a semi-supervised learning perspective. Softw Qual J 15(3):327–344
Shanthini A, Vinodhini G, Chandrasekaran RM, Supraja P (2019) A taxonomy on impact of label noise and feature noise using machine learning techniques. Soft Comput 23(18):8597–8607
Sharma S, Bellinger C, Krawczyk B, Zaiane O, Japkowicz N (2018) Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In: 2018 IEEE international conference on data mining (ICDM). IEEE, pp 447–456
Shatnawi R (2017) The application of roc analysis in threshold identification, data imbalance and metrics selection for software fault prediction. Innov Syst Softw Eng 13(2):201–217
Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: Some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215
Shihab E (2012) An exploration of challenges limiting pragmatic software defect prediction. PhD thesis
Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Soft Eng
Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300
Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 99–108
Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1. IEEE, pp 812–823
Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111
Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th international conference on software engineering (ICSE). IEEE, pp 297–308
Wei H, Hu C, Chen S, Xue Y, Zhang Q (2019) Establishing a software defect prediction model via effective dimension reduction. Inf Sci 477:399–409
Witten IH, Frank E (2000) Weka. Machine Learning Algorithms in Java, pp 265–320
Wu X (1995) Knowledge acquisition from databases. Intellect books
Yang X, Lo D, Xia X, Sun J (2017) TLEL: a two-layer ensemble learning approach for just-in-time defect prediction. Inf Softw Technol 87:206–220
Zhang H, Zhang X, Gu M (2007) Predicting defective software components from code complexity measures. In: 13th Pacific rim international symposium on dependable computing (PRDC 2007). IEEE, pp 93–96
Zheng J (2010) Cost-sensitive boosting neural networks for software defect prediction. Expert Syst Appl 37(6):4537–4543
Zhou T, Sun X, Xia X, Li B, Chen X (2019) Improving defect prediction with deep forest. Inf Softw Technol
Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210
Acknowledgements
Authors would like to thanks IIT-BHU for providing such a vital research platform for students researchers and faculties. We also want to thanks faculty members and students of the CSE department in IIT-BHU for their valuable comments and motivation.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by Sushant Kumar Pandey.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Pandey, S.K., Tripathi, A.K. An empirical study toward dealing with noise and class imbalance issues in software defect prediction. Soft Comput 25, 13465–13492 (2021). https://doi.org/10.1007/s00500-021-06096-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-021-06096-3