Skip to main content
Log in

An empirical study toward dealing with noise and class imbalance issues in software defect prediction

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

The quality of the defect datasets is a critical issue in the domain of software defect prediction (SDP). These datasets are obtained through the mining of software repositories. Recent studies claim over the quality of the defect dataset. It is because of inconsistency between bug/clean fix keyword in fault reports and the corresponding link in the change management logs. Class Imbalance (CI) problem is also a big challenging issue in SDP models. The defect prediction method trained using noisy and imbalanced data leads to inconsistent and unsatisfactory results. Combined analysis over noisy instances and CI problem needs to be required. To the best of our knowledge, there are insufficient studies that have been done over such aspects. In this paper, we deal with the impact of noise and CI problem on five baseline SDP models; we manually added the various noise level (0–80%) and identified its impact on the performance of those SDP models. Moreover, we further provide guidelines for the possible range of tolerable noise for baseline models. We have also suggested the SDP model, which has the highest noise tolerable ability and outperforms over other classical methods. The True Positive Rate (TPR) and False Positive Rate (FPR) values of the baseline models reduce between 20–30% after adding 10–40% noisy instances. Similarly, the ROC (Receiver Operating Characteristics) values of SDP models reduce to 40–50%. The suggested model leads to avoid noise between 40–60% as compared to other traditional models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Abaei G, Selamat A, Fujita H (2015) An empirical study based on semi-supervised hybrid self-organizing map for software fault prediction. Knowl Based Syst 74:28–39

    Article  Google Scholar 

  • Alan O, Catal C (2011) Thresholds based outlier detection approach for mining class outliers: An empirical case study on software measurement datasets. Expert Syst Appl 38(4):3440–3445

    Article  Google Scholar 

  • Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. Int J Comput Sci Issues (IJCSI) 9(5):272

    Google Scholar 

  • Ali A, Mariyam SS, Ralescu AL (2015) Classification with class imbalance problem: a review. Int J Adv Soft Comput Appl 7(3):176–204

    Google Scholar 

  • Bachmann A, Bird C, Rahman F, Devanbu P, Bernstein A (2010) The missing links: bugs and bug-fix commits. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 97–106

  • Bennin KE, Keung JW, Monden A (2019) On the relative value of data resampling approaches for software defect prediction. Empir Softw Eng 24(2):602–636

    Article  Google Scholar 

  • Bhargava N, Sharma G, Bhargava R, Mathuria M (2013) Decision tree analysis on j48 algorithm for data mining. Proc Int J Adv Res Comput Sci Softw Eng 3(6)

  • Bird C, Bachmann A, Aune E, Duffy J, Bernstein A, Filkov V, Devanbu P (2009) Fair and balanced?: Bias in bug-fix datasets. In: Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering. ACM, pp 121–130

  • Cabral GG, Minku LL, Shihab E, Mujahid S (2019) Class imbalance evolution and verification latency in just-in-time software defect prediction. In: 2019 IEEE/ACM 41st international conference on software engineering (ICSE). IEEE, pp 666–676

  • Catal C (2014) A comparison of semi-supervised classification approaches for software defect prediction. J Intell Syst 23(1):75–82

    Google Scholar 

  • Catal C, Diri B (2007) Software defect prediction using artificial immune recognition system. In: Proceedings of the 25th conference on IASTED international multi-conference: software engineering. ACTA Press, pp 285–290

  • Catal C, Diri B (2009) Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem. Inf Sci 179(8):1040–1058

    Article  Google Scholar 

  • Catal C, Diri B, Ozumut B (2007) An artificial immune system approach for fault prediction in object-oriented software. In: 2nd International conference on dependability of computer systems (DepCoS-RELCOMEX’07). IEEE, pp 238–245

  • Catal C, Sevim U, Diri B (2010) Metrics-driven software quality prediction without prior fault data. In: Electronic engineering and computing technology. Springer, Berlin, pp 189–199

  • Catal C, Alan O, Balkan K (2011a) Class noise detection based on software metrics and ROC curves. Inf Sci 181(21):4867–4877

  • Catal C, Sevim U, Diri B (2011b) Practical development of an eclipse-based software fault prediction tool using Naive Bayes algorithm. Expert Syst Appl 38(3):2347–2353

  • Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  • Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16

    Article  Google Scholar 

  • Chawla NV (2009) Data mining for imbalanced datasets: an overview. In: Data mining and knowledge discovery handbook. Springer, Berlin, pp 875–886

  • Chawla NV, Bowyer KW, Hall LO, Philip KW (2002) Smote synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  • Chen L, Fang B, Shang Z, Tang Y (2018) Tackling class overlap and imbalance problems in software defect prediction. Softw Qual J 26(1):97–125

    Article  Google Scholar 

  • Clarkson KL, Shor PW (1989) Applications of random sampling in computational geometry, ii. Discret Comput Geom 4(5):387–421

    Article  MathSciNet  MATH  Google Scholar 

  • Davies D (1995) Parallel processing with subsampling/spreading circuitry and data transfer circuitry to and from any processing unit

  • Davis J, Goadrich M (2006) The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233–240

  • Dor O, Zhou Y (2007) Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins Struct Funct Bioinform 66(4):838–845

    Article  Google Scholar 

  • Elish KO, Elish MO (2008) Predicting defect-prone software modules using support vector machines. J Syst Softw 81(5):649–660

    Article  Google Scholar 

  • Fan Y, Alencar da Costa D, Lo D, Hassan AE, Shanping L (2020) The impact of mislabeled changes by SZZ on just-in-time defect prediction. IEEE Trans Softw Eng

  • Hall T, Beecham S, Bowes D, Gray D, Steve Counsell (2011) A systematic literature review on fault prediction performance in software engineering. IEEE Trans Softw Eng 38(6):1276–1304

    Article  Google Scholar 

  • Hou S, Li Y (2009) Short-term fault prediction based on support vector machines with parameter optimization by evolution strategy. Expert Syst Appl 36(10):12383–12391

    Article  Google Scholar 

  • Hu W, Hu W, Maybank S (2008) Adaboost-based algorithm for network intrusion detection. IEEE Trans Syst Man Cybern Part B (Cybern) 38(2):577–583

    Article  Google Scholar 

  • Huang Y, Li L (2011) Naive Bayes classification algorithm based on small sample set. In: 2011 IEEE international conference on cloud computing and intelligence systems, pp 34–39

  • Ji H, Huang S, Wu Y, Hui Z, Zheng C (2019) A new weighted Naive Bayes method based on information diffusion for software defect prediction. Softw Qual J 1–46

  • Joon A, Kumar TR, Kumar K (2020) Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite. In: 2020 5th International conference on communication and electronics systems (ICCES). IEEE, pp 1381–1389

  • Kan SH (2002) Metrics and models in software quality engineering. Addison-Wesley Longman Publishing Co., Inc

  • Kaur A, Malhotra R (2008) Application of random forest in predicting fault-prone classes. In: 2008 International conference on advanced computer theory and engineering. IEEE, pp 37–43

  • Khanh DH, Pham T, Wee NS, Tran T, Grundy J, Ghose A, Kim T, Kim C-J (2018) A deep tree-based model for software defect prediction. arXiv preprint arXiv:1802.00921

  • Kim S, Zimmermann T, Pan K, James Jr E et al (2006) Automatic identification of bug-introducing changes. In: 21st IEEE/ACM international conference on automated software engineering (ASE’06). IEEE, pp 81–90

  • Kim S, James WE Jr, Zhang Y (2008) Classifying software changes: clean or buggy? IEEE Trans Softw Eng 34(2):181–196

    Article  Google Scholar 

  • Kim S, Zhang H, Wu R, Gong L (2011) Dealing with noise in defect prediction. In: 2011 33rd International conference on software engineering (ICSE). IEEE, pp 481–490

  • Kotsiantis SB, Pintelas PE (2003) Mixture of expert agents for handling imbalanced data sets. Ann Math Comput Teleinformatics 1(1):46–55

    Google Scholar 

  • Lam FC, Longnecker MT (1983) A modified Wilcoxon rank sum test for paired data. Biometrika 70(2):510–513

    Article  MathSciNet  Google Scholar 

  • Laradji IH, Alshayeb M, Ghouti L (2015) Software defect prediction using ensemble learning on selected features. Inf Softw Technol 58:388–402

    Article  Google Scholar 

  • Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. In: Conference on artificial intelligence in medicine in Europe. Springer, Berlin, pp 63–66

  • Lee C, Lee GG (2006) Information gain and divergence-based feature selection for machine learning-based text categorization. Inf Process Manag 42(1):155–165

    Article  Google Scholar 

  • Li M, Zhang H, Wu R, Zhou Z-H (2012) Sample-based software defect prediction with active and semi-supervised learning. Autom Softw Eng 19(2):201–230

    Article  Google Scholar 

  • Li J, He P, Zhu J, Lyu MR (2017) Software defect prediction via convolutional neural network. In: 2017 IEEE international conference on software quality, reliability and security (QRS). IEEE, pp 318–328

  • Li L, Lessmann S, Baesens B (2019) Evaluating software defect prediction performance: an updated benchmarking study. arXiv preprint arXiv:1901.01726

  • Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22

    Google Scholar 

  • Limsettho N, Bennin KE, Keung JW, Hata H, Matsumoto K (2018) Cross project defect prediction using class distribution estimation and oversampling. Inf Softw Technol 100:87–102

    Article  Google Scholar 

  • Linberg KR (1999) Software developer perceptions about software project failure: a case study. J Syst Softw 49(2–3):177–192

    Article  Google Scholar 

  • Liu Y, Khoshgoftaar T (2004) Reducing overfitting in genetic programming models for software quality classification. In: Eighth IEEE international symposium on high assurance systems engineering, 2004. Proceedings. IEEE, pp 56–65

  • Lu H, Cukic B, Culp M (2012) Software defect prediction using semi-supervised learning with dimension reduction. In: 2012 Proceedings of the 27th IEEE/ACM international conference on automated software engineering. IEEE, pp 314–317

  • Maletic JI, Marcus A (2000) Data cleansing: beyond integrity analysis. In: Iq. Citeseer, pp 200–209

  • Malhotra R (2015) A systematic review of machine learning techniques for software fault prediction. Appl Soft Comput 27:504–518

    Article  Google Scholar 

  • Moore AW (2001) Cross-validation for detecting and preventing overfitting. School of Computer Science Carneigie Mellon University

  • Moser R, Pedrycz W, Succi G (2008) A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In: Proceedings of the 30th international conference on software engineering. ACM, pp 181–190

  • Nagappan N, Ball T, Zeller A (2006) Mining metrics to predict component failures. In: Proceedings of the 28th international conference on Software engineering. ACM, pp 452–461

  • Nam J, Pan SJ, Kim S (2013) Transfer defect learning. In: 2013 35th International conference on software engineering (ICSE). IEEE, pp 382–391

  • Obuchowski NA (1997) Nonparametric analysis of clustered roc curve data. Biometrics 567–578

  • Offutt AJA (1992) Investigations of the software testing coupling effect. ACM Trans Softw Eng Methodol (TOSEM) 1(1):5–20

    Article  Google Scholar 

  • Pandey SK, Tripathi AK (2020) BCV-predictor: a bug count vector predictor of a successive version of the software system. Knowl Based Syst 197:105924

    Article  Google Scholar 

  • Pandey SK, Mishra RB, Triphathi AK (2018) Software bug prediction prototype using Bayesian network classifier: a comprehensive model. Procedia Comput Sci 132:1412–1421

    Article  Google Scholar 

  • Pandey SK, Mishra RB, Tripathi AK (2020) BPDET: An effective software bug prediction model using deep representation and ensemble learning techniques. Expert Syst Appl 144:113085

    Article  Google Scholar 

  • Pandey SK, Rathee D, Tripathi AK (2020) Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study. IET Softw 14(7):768–782

    Article  Google Scholar 

  • Pandey SK, Mishra RB, Tripathi AK (2021) Machine learning based methods for software fault prediction: a survey. Expert Syst Appl 114595

  • Peng Y, Kou G, Wang G, Wu W, Shi Y (2011) Ensemble of software defect predictors: an AHP-based evaluation method. Int J Inf Technol Decis Mak 10(01):187–206

    Article  Google Scholar 

  • Peters F, Menzies T, Marcus A (2013) Better cross company defect prediction. In: 2013 10th Working conference on mining software repositories (MSR). IEEE, pp 409–418

  • Queiroz R, Berger T, Czarnecki K (2016) Towards predicting feature defects in software product lines. In: Proceedings of the 7th international workshop on feature-oriented software development. ACM, pp 58–62

  • Rätsch G, Onoda T, Müller KR (1998) An improvement of adaboost to avoid overfitting. In: Proceedings of the international conference on neural information processing. Citeseer

  • Radjenović D, Heričko M, Torkar R, Živkovič A (2013) Software fault prediction metrics: an systematic literature review. Inf Softw Technol 55(8):1397–1418

    Article  Google Scholar 

  • Rahman F, Devanbu P (2013) How, and why, process metrics are better. In: 2013 35th International conference on software engineering (ICSE). IEEE, pp 432–441

  • Ram N, Gerstorf D, Fauth E, Zarit S, Malmberg B (2010) Aging, disablement, and dying: using time-as-process and time-as-resources metrics to chart late-life change. Res Hum Dev 7(1):27–44

    Article  Google Scholar 

  • Ramler R, Himmelbauer J (2013) Noise in bug report data and the impact on defect prediction results. In: 2013 Joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement. IEEE, pp 173–180

  • Riaz S, Arshad A, Jiao L (2018) Rough noise-filtered easy ensemble for software fault prediction. IEEE Access 6:46886–46899

    Article  Google Scholar 

  • Rish I et al (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46

  • Roobaert D, Karakoulas G, Chawla NV (2006) Information gain, correlation and support vector machines. In: Feature extraction. Springer, Berlin, pp 463–470

  • Schneider GM, Martin J, Tsai W-T (1992) An experimental study of fault detection in user requirements documents. ACM Trans Softw Eng Methodol (TOSEM) 1(2):188–204

    Article  Google Scholar 

  • Seiffert C, Khoshgoftaar TM, Van Hulse J, Folleco A (2014) An empirical study of the classification performance of learners on imbalanced and noisy software quality data. Inf Sci 259:571–595

    Article  Google Scholar 

  • Seliya N, Khoshgoftaar TM (2007) Software quality estimation with limited fault data: a semi-supervised learning perspective. Softw Qual J 15(3):327–344

    Article  Google Scholar 

  • Shanthini A, Vinodhini G, Chandrasekaran RM, Supraja P (2019) A taxonomy on impact of label noise and feature noise using machine learning techniques. Soft Comput 23(18):8597–8607

    Article  Google Scholar 

  • Sharma S, Bellinger C, Krawczyk B, Zaiane O, Japkowicz N (2018) Synthetic oversampling with the majority class: A new perspective on handling extreme imbalance. In: 2018 IEEE international conference on data mining (ICDM). IEEE, pp 447–456

  • Shatnawi R (2017) The application of roc analysis in threshold identification, data imbalance and metrics selection for software fault prediction. Innov Syst Softw Eng 13(2):201–217

    Article  Google Scholar 

  • Shepperd M, Song Q, Sun Z, Mair C (2013) Data quality: Some comments on the NASA software defect datasets. IEEE Trans Softw Eng 39(9):1208–1215

    Article  Google Scholar 

  • Shihab E (2012) An exploration of challenges limiting pragmatic software defect prediction. PhD thesis

  • Song Q, Guo Y, Shepperd M (2018) A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans Soft Eng

  • Suykens JAK, Vandewalle J (1999) Least squares support vector machine classifiers. Neural Process Lett 9(3):293–300

    Article  Google Scholar 

  • Tan M, Tan L, Dara S, Mayeux C (2015) Online defect prediction for imbalanced data. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2. IEEE, pp 99–108

  • Tantithamthavorn C, McIntosh S, Hassan AE, Ihara A, Matsumoto K (2015) The impact of mislabelling on the performance and interpretation of defect prediction models. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 1. IEEE, pp 812–823

  • Tong H, Liu B, Wang S (2018) Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning. Inf Softw Technol 96:94–111

  • Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction. In: 2016 IEEE/ACM 38th international conference on software engineering (ICSE). IEEE, pp 297–308

  • Wei H, Hu C, Chen S, Xue Y, Zhang Q (2019) Establishing a software defect prediction model via effective dimension reduction. Inf Sci 477:399–409

  • Witten IH, Frank E (2000) Weka. Machine Learning Algorithms in Java, pp 265–320

  • Wu X (1995) Knowledge acquisition from databases. Intellect books

  • Yang X, Lo D, Xia X, Sun J (2017) TLEL: a two-layer ensemble learning approach for just-in-time defect prediction. Inf Softw Technol 87:206–220

    Article  Google Scholar 

  • Zhang H, Zhang X, Gu M (2007) Predicting defective software components from code complexity measures. In: 13th Pacific rim international symposium on dependable computing (PRDC 2007). IEEE, pp 93–96

  • Zheng J (2010) Cost-sensitive boosting neural networks for software defect prediction. Expert Syst Appl 37(6):4537–4543

    Article  Google Scholar 

  • Zhou T, Sun X, Xia X, Li B, Chen X (2019) Improving defect prediction with deep forest. Inf Softw Technol

  • Zhu X, Wu X (2004) Class noise vs. attribute noise: a quantitative study. Artif Intell Rev 22(3):177–210

Download references

Acknowledgements

Authors would like to thanks IIT-BHU for providing such a vital research platform for students researchers and faculties. We also want to thanks faculty members and students of the CSE department in IIT-BHU for their valuable comments and motivation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sushant Kumar Pandey.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by Sushant Kumar Pandey.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pandey, S.K., Tripathi, A.K. An empirical study toward dealing with noise and class imbalance issues in software defect prediction. Soft Comput 25, 13465–13492 (2021). https://doi.org/10.1007/s00500-021-06096-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-021-06096-3

Keywords

Navigation