Skip to main content
Log in

A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Imbalance classification has always been a popular research point in the application of machine learning, data mining and pattern recognition. At present, there are also many techniques to reduce the negative impact of imbalance on classification performance, and oversampling is the most commonly used one. In this paper, we illustrate the relationship between imbalance rate and classification performance in the oversampling process from a novel perspective that oversampling may cause the loss of the distribution while minority class is enhanced. In addition, this paper proposes a novel cross-validation framework called “icross-validation” that can be used in sampling to find a better state than the balanced state. This framework is general and can be applied into various oversampling methods. In comparison with some state-of-the-art and widely used oversampling methods, the experimental results on some real data sets demonstrate the effectiveness of the icross-validation. All code has been released in the open source icross-validation library at https://github.com/syxiaa/icross-valiation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Chen B, Xia S, Chen Z, Wang B, Wang G (2021) Rsmote: a self-adaptive robust smote for imbalanced problems with label noise. Inf Sci 553:397–428

    Article  MathSciNet  MATH  Google Scholar 

  2. Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and smote. Inf Sci 465:1–20

    Article  Google Scholar 

  3. Alam TM, Shaukat K, Mahboob H, Sarwar MU, Iqbal F, Nasir A, Hameed IA, Luo S (2021) A machine learning approach for identification of malignant mesothelioma etiological factors in an imbalanced dataset. Comput J 65(7):1740–1751

    Article  Google Scholar 

  4. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog Artif Intell 5(4):221–232

    Article  Google Scholar 

  5. López V, Fernández A, Moreno-Torres JG, Herrera F (2012) Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. open problems on intrinsic data characteristics. Expert Syst Appl 39(7):6585–6608

  6. Petrides G, Moldovan D, Coenen L, Guns T, Verbeke W (2022) Cost-sensitive learning for profit-driven credit scoring. J Oper Res Soc 73(2):338–350

    Article  Google Scholar 

  7. Datta S, Nag S, Das S (2019) Boosting with lexicographic programming: addressing class imbalance without cost tuning. IEEE Trans Knowl Data Eng 32(5):883–897

    Article  Google Scholar 

  8. Datta S, Das S (2018) Multiobjective support vector machines: handling class imbalance with pareto optimality. IEEE Trans Neural Netw Learn Syst 30(5):1602–1608

    Article  MathSciNet  Google Scholar 

  9. Maulidevi NU, Surendro K et al (2022) Smote-lof for noise identification in imbalanced data classification. J King Saud Univ Comput Inf Sci 34(6, Part B):3413–3423

  10. Ren J, Wang Y, Cheung Y-M, Gao X-Z, Guo X (2023) Grouping-based oversampling in kernel space for imbalanced data classification. Pattern Recognit 133:108992

    Article  Google Scholar 

  11. Sandhan T, Choi JY (2014) Handling imbalanced datasets by partially guided hybrid sampling for pattern recognition. In: 2014 22nd international conference on pattern recognition. IEEE, pp 1449–1453

  12. Japkowicz N et al (2000) Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, vol 68. Menlo Park, CA, pp 10–15

  13. Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2012) Dbsmote: density-based synthetic minority over-sampling technique. Appl Intell 36(3):664–684

    Article  Google Scholar 

  14. Zhai J, Qi J, Shen C (2022) Binary imbalanced data classification based on diversity oversampling by generative models. Inf Sci 585:313–343

    Article  Google Scholar 

  15. Lunardon N, Menardi G, Torelli N (2014) Rose: a package for binary imbalanced learning. R J 6(1)

  16. Barua S, Islam MM, Yao X, Murase K (2012) Mwmote-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  17. Arafa A, El-Fishawy N, Badawy M, Radad M (2022) Rn-smote: reduced noise smote based on dbscan for enhancing imbalanced data classification. J King Saud Univ Comput Inf Sci 34(8, Part A):5059–5074

  18. Soltanzadeh P, Hashemzadeh M (2021) Rcsmote: range-controlled synthetic minority over-sampling technique for handling the class imbalance problem. Inf Sci 542:92–111

    Article  MathSciNet  MATH  Google Scholar 

  19. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  MATH  Google Scholar 

  20. Rivera WA (2017) Noise reduction a priori synthetic over-sampling for class imbalanced data sets. Inf Sci 408:146–161

    Article  Google Scholar 

  21. Sáez JA, Luengo J, Stefanowski J, Herrera F (2015) Smote-ipf: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inf Sci 291:184–203

    Article  Google Scholar 

  22. Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on intelligent computing. Springer, pp 878–887

  23. Das B, Krishnan NC, Cook DJ (2014) Racog and wracog: two probabilistic oversampling techniques. IEEE Trans Knowl Data Eng 27(1):222–234

    Article  Google Scholar 

  24. Abdi L, Hashemi S (2015) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  25. Xie Z, Jiang L, Ye T, Li X (2015) A synthetic minority oversampling method based on local densities in low-dimensional space for imbalanced learning. In: International conference on database systems for advanced applications. Springer, pp 3–18

  26. Zhou H, Dong X, Xia S, Wang G (2021) Weighted oversampling algorithms for imbalanced problems and application in prediction of streamflow. Knowl Based Syst 229:107306

    Article  Google Scholar 

  27. He H, Bai Y, Garcia EA, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World congress on computational intelligence). IEEE, pp 1322–1328

  28. Prati RC, Batista GE, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45(1):247–270

    Article  Google Scholar 

  29. Barella V, Garcia L, de Carvalho A (2018) The influence of sampling on imbalanced data classification. In: 2019 8th Brazilian conference on intelligent systems (BRACIS). IEEE, pp 210–215

  30. Thabtah F, Hammoud S, Kamalov F, Gonsalves A (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441

    Article  MathSciNet  Google Scholar 

  31. He J, Zhang S, Yang M, Shan Y, Huang T (2020) Bi-directional cascade network for perceptual edge detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62176033 and 61936001, Key Cooperation Project of Chongqing Municipal Education Commission under Grant No. HZ2021008, and Natural Science Foundation of Chongqing under Grant No. cstc2019jcyj-cxttX0002, National Key Research and Development Program of China under Grant No. 2019QY(Y)0301.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuyin Xia.

Ethics declarations

Conflict of interest

All authors have no conflict of interest to declare

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 78 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dai, Q., Li, D. & Xia, S. A cross-validation framework to find a better state than the balanced one for oversampling in imbalanced classification. Int. J. Mach. Learn. & Cyber. 14, 2877–2886 (2023). https://doi.org/10.1007/s13042-023-01804-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-023-01804-x

Keywords

Navigation