Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Prati, Ronaldo C.; Batista, Gustavo E. A. P. A.; Silva, Diego F.

doi:10.1007/s10115-014-0794-3

Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Regular Paper
Published: 17 October 2014

Volume 45, pages 247–270, (2015)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ronaldo C. Prati¹,
Gustavo E. A. P. A. Batista² &
Diego F. Silva²

1975 Accesses
131 Citations
9 Altmetric
Explore all metrics

Abstract

In the last decade, class imbalance has attracted a huge amount of attention from researchers and practitioners. Class imbalance is ubiquitous in Machine Learning, Data Mining and Pattern Recognition applications; therefore, these research communities have responded to such interest with literally dozens of methods and techniques. Surprisingly, there are still many fundamental open-ended questions such as “Are all learning paradigms equally affected by class imbalance?”, “What is the expected performance loss for different imbalance degrees?” and “How much of the performance losses can be recovered by the treatment methods?”. In this paper, we propose a simple experimental design to assess the performance of class imbalance treatment methods. This experimental setup uses real data set with artificially modified class distributions to evaluate classifiers in a wide range of class imbalance. We apply such experimental design in a large-scale experimental evaluation with 22 data set and seven learning algorithms from different paradigms. We also propose a statistical procedure aimed to evaluate the relative degradation and recoveries, based on confidence intervals. This procedure allows a simple yet insightful visualization of the results, as well as provide the basis for drawing statistical conclusions. Our results indicate that the expected performance loss, as a percentage of the performance obtained with the balanced distribution, is quite modest (below 5 %) for the most balanced distributions up to 10 % of minority examples. However, the loss tends to increase quickly for higher degrees of class imbalance, reaching 20 % for 1 % of minority class examples. Support Vector Machine is the classifier paradigm that is less affected by class imbalance, being almost insensitive to all but the most imbalanced distributions. Finally, we show that the treatment methods only partially recover the performance losses. On average, typically, about 30 % or less of the performance that was lost due to class imbalance was recovered by these methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A literature survey on various aspect of class imbalance problem in data mining

Article 03 February 2024

An Empirical Study of Multi-class Imbalance Learning Algorithms

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

Article 23 June 2022

Notes

We use the notation \(X/Y\), with \(X+Y=100\) to denote that for a set of 100 instances, \(X\) belongs to the positive class and \(Y\) belongs to the negative class.
CRAN (http://cran.r-project.org) is a network of Web servers distributed around the world that store versions of code and documentation for the statistical software R, as well as community contributed packages.

References

Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor 6(1):20–29
Article Google Scholar
Bennett BM (1965) Confidence limits for a ratio using Wilcoxon’s signed rank test. Biometics 21(1):231–234
Article Google Scholar
Berrar D, Lozano JA (2013) Significance tests or confidence intervals: which are preferable for the comparison of classifiers?. J Exp Theor Artif Intell 25(2):189–206. http://www.ingentaconnect.com/content/tandf/teta/2013/00000025/00000002/art00003
Borgelt C (2012) Christian borgelt web page. http://www.borgelt.net/
Chang C-C, Lin C-J (2012) Libsvm—a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Cieslak D, Chawla N (2008) Analyzing pets on imbalanced datasets when training and testing class distributions differ. In: Pacific-Asia conference on advances in knowledge discovery and data mining, pp 519–526
Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: European working session on machine learning, pp 151–163
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Article Google Scholar
Cohen WW (1995) Fast effective rule induction. In: International conference on machine learning. Morgan Kaufmann, Los Altos, CA, pp 115–123
Domingos P (1999) Metacost: a general method for making classifiers cost-sensitive. In: Fayyad UM, Chaudhuri S, Madigan D (eds) ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 155–164
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874
Article MathSciNet Google Scholar
Foody GM (2009) Classification accuracy comparison: Hypothesis tests and the use of confidence intervals in evaluations of difference, equivalence and non-inferiority. Remote Sens Environ 113(8):1658–1663. http://www.sciencedirect.com/science/article/pii/S0034425709000923
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Froemke C, Hothorn L, Schneider M (2012) Confidence intervals for the ratio of locations and for the ratio of scales of two paired samples. Technical report, The Comprehensive R Archive Network. http://cran.r-project.org/web/packages/pairedCI/index.html
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F (2012) A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans Syst Man Cybern Part C 42(4):463–484
Article Google Scholar
Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor 6(1):30–39
Article Google Scholar
Han H, Wang W-Y, Mao B-H (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In: International conference on advances in intelligent computing. Lecture notes in computer science. Springer, Berlin, pp 878–887. doi:10.1007/11538059_91
He H, Bai Y, Garcia E, Li S (2008) Adasyn: adaptive synthetic sampling approach for imbalanced learning. In: IEEE international joint conference on neural networks, pp 1322–1328
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449
MATH Google Scholar
Khoshgoftaar TM, Seiffert C, Hulse JV, Napolitano A, Folleco A (2007) Learning with limited minority class data. In: International conference on machine learning and applications, pp 348–353
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2–3):195–215
Article Google Scholar
Liu X-Y, Wu J, Zhou Z-H (2006) Exploratory under-sampling for class-imbalance learning. In: IEEE international conference on data mining, pp 965–969
Liu X-Y, Zhou Z-H (2006) The influence of class imbalance on cost-sensitive learning: an empirical study. In: ‘ICDM’, IEEE Computer Society, pp 970–974
Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine learning, neural and statistical classification. Ellis Horwood, New york
Phua C, Alahakoon D, Lee V (2004) Minority report in fraud detection: classification of skewed data. SIGKDD Explor 6(1):50–59
Article Google Scholar
Prati RC, Batista GEAPA, Monard MC (2011) A survey on graphical methods for classification predictive performance evaluation. IEEE Trans Knowl Data Eng 23(11):1601–1618
Article Google Scholar
Prati RC, Batista GEAPA, Silva DF (2013) Paper website. http://sites.labic.icmc.usp.br/ClassImbalanceRevisited/
Provost FJ, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Shavlik JW (ed) International conference on machine learning. Morgan Kaufmann, Los Altos, CA, pp 445–453
Google Scholar
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc, Los Altos, CA
Wallace B, Small K, Brodley C, Trikalinos T (2011) Class imbalance, redux. In: IEEE international conference on data mining, pp 754–763
Wang X, Matwin S, Japkowicz N, Liu X (2013) Cost-sensitive boosting algorithms for imbalanced multi-instance datasets. In: Zaïane OR, Zilles S (eds) Canadian conference on artificial intelligence, vol 7884 of lecture notes in computer science. Springer, Berlin, pp 174–186
Weiss GM (2004) Mining with rarity: a unifying framework. SIGKDD Explor 6(1):7–19
Article Google Scholar
Weiss GM, McCarthy K, Zabar B (2007) Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In: IEEE international conference on data mining, pp 35–41
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19:315–354
MATH Google Scholar
Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. In: Workshop on learning from imbalanced Datasets in international conference on machine learning

Download references

Acknowledgments

We thank the anonymous reviewers for their comments on the draft of this paper. We also thank Nitesh Chawla for providing the Microcalcifications in Mammography data set. This work was funded by FAPESP award 2012/07295-3.

Author information

Authors and Affiliations

Centro de Matemática, Computação e Cognição, Universidade Federal do ABC, Santo André, SP, Brazil
Ronaldo C. Prati
Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, Brazil
Gustavo E. A. P. A. Batista & Diego F. Silva

Authors

Ronaldo C. Prati
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo E. A. P. A. Batista
View author publications
You can also search for this author in PubMed Google Scholar
Diego F. Silva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ronaldo C. Prati.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prati, R.C., Batista, G.E.A.P.A. & Silva, D.F. Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45, 247–270 (2015). https://doi.org/10.1007/s10115-014-0794-3

Download citation

Received: 26 September 2013
Revised: 17 April 2014
Accepted: 04 October 2014
Published: 17 October 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10115-014-0794-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Abstract

Access this article

Similar content being viewed by others

A literature survey on various aspect of class imbalance problem in data mining

An Empirical Study of Multi-class Imbalance Learning Algorithms

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Class imbalance revisited: a new experimental setup to assess the performance of treatment methods

Abstract

Access this article

Similar content being viewed by others

A literature survey on various aspect of class imbalance problem in data mining

An Empirical Study of Multi-class Imbalance Learning Algorithms

An empirical study on the joint impact of feature selection and data resampling on imbalance classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation