Skip to main content
Log in

Framework for extreme imbalance classification: SWIM—sampling with the majority class

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The class imbalance problem is a pervasive issue in many real-world domains. Oversampling methods that inflate the rare class by generating synthetic data are amongst the most popular techniques for resolving class imbalance. However, they concentrate on the characteristics of the minority class and use them to guide the oversampling process. By completely overlooking the majority class, they lose a global view on the classification problem and, while alleviating the class imbalance, may negatively impact learnability by generating borderline or overlapping instances. This becomes even more critical when facing extreme class imbalance, where the minority class is strongly underrepresented and on its own does not contain enough information to conduct the oversampling process. We propose a framework for synthetic oversampling that, unlike existing resampling methods, is robust on cases of extreme imbalance. The key feature of the framework is that it uses the density of the well-sampled majority class to guide the generation process. We demonstrate implementations of the framework using the Mahalanobis distance and a radial basis function. We evaluate over 25 benchmark datasets and show that the framework offers a distinct performance improvement over the existing state-of-the-art in oversampling techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The SWIM code is available here: https://github.com/cbellinger27/SWIM.

  2. In practice, we soften the equality to be less than or equal to.

  3. This takes advantage of the whitening done in the preprocessing, as instead of dealing with the MD, we can use the Euclidean distance.

  4. Using our default setting based on the mean distance between majority class samples, \(\epsilon \) is set to 1.68.

References

  1. Abdi L, Hashemi S (2016) To combat multi-class imbalanced problems by means of over-sampling techniques. IEEE Trans Knowl Data Eng 28(1):238–251

    Article  Google Scholar 

  2. Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255–287

    Google Scholar 

  3. Barua S, Islam MM, Yao X, Murase K (2014) MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans Knowl Data Eng 26(2):405–425

    Article  Google Scholar 

  4. Bellinger C, Drummond C, Japkowicz N (2018) Manifold-based synthetic oversampling with manifold conformance estimation. Mach Learn 107(3):605–637

    Article  MathSciNet  Google Scholar 

  5. Bellinger C, Sharma S, Japkowicz N (2012) One-class versus binary classification: which and when? In: 11th international conference on machine learning and applications, vol 2, pp 102–106

  6. Chawla N, Bowyer K, Hall L, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    Article  Google Scholar 

  7. Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost: improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer, Berlin, pp 107–119

    Chapter  Google Scholar 

  8. Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence. Lawrence Erlbaum Associates Ltd, Hillsdale, vol 17, pp 973–978

  9. Han H, Wang WY, Mao BH (2005) Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning. In: Advances in intelligent computing, pp 878–887

    Google Scholar 

  10. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks (IEEE world congress on computational intelligence) (3), 1322–1328

  11. He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  12. Koziarski M, Krawczyk B, Woźniak M (2017) Radial-based approach to imbalanced data oversampling. In: International conference on hybrid artificial intelligence systems. Springer, Berlin, pp. 318–327

    Google Scholar 

  13. Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Prog. Artif. Intell. 5(4):221–232

    Article  Google Scholar 

  14. Kubat M, Matwin S et al (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: ICML, Nashville, USA, vol 97, pp 179–186

  15. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12(1):2825–2830

    MathSciNet  MATH  Google Scholar 

  16. Sharma S, Bellinger C, Krawczyk B, Japkowicz N, Zaïane O (2018) Synthetic oversampling with the majority class: a new perspective on handling extreme imbalance. In: Proceedings of In IEEE international conference on data mining

  17. Sharma S, Somayaji A, Japkowicz N (2018) Learning over subconcepts: strategies for 1-class classification. Comput Intell 34(2):440–467

    Article  MathSciNet  Google Scholar 

  18. Siers MJ, Islam MZ (2015) Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Inf Syst 51:62–71

    Article  Google Scholar 

  19. Tomek I (1976) Modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772

    MathSciNet  MATH  Google Scholar 

  20. Wang H, Gao Y, Shi Y, Wang H (2016) A fast distributed classification algorithm for large-scale imbalanced data. In: IEEE 16th international conference on data mining, ICDM 2016, December 12–15, 2016, Barcelona, Spain, pp 1251–1256

  21. Wei W, Li J, Cao L, Ou Y, Chen J (2013) Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16(4):449–475

    Article  Google Scholar 

  22. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 3:408–421

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Colin Bellinger.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Tabulated results

Appendix: Tabulated results

See Tables 4, 5, 6, 7, 8 and 9.

Table 4 G-means for minority training size 3
Table 5 G-means for minority training size 5
Table 6 G-means for minority training size 10
Table 7 G-means for minority training size 15
Table 8 G-means for minority training size 20
Table 9 G-means for minority training size 30

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bellinger, C., Sharma, S., Japkowicz, N. et al. Framework for extreme imbalance classification: SWIM—sampling with the majority class. Knowl Inf Syst 62, 841–866 (2020). https://doi.org/10.1007/s10115-019-01380-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-019-01380-z

Keywords

Navigation