NEATER: filtering of over-sampled data using non-cooperative game theory

Almogahed, B. A.; Kakadiaris, I. A.

doi:10.1007/s00500-014-1484-5

NEATER: filtering of over-sampled data using non-cooperative game theory

Methodologies and Application
Published: 19 October 2014

Volume 19, pages 3301–3322, (2015)
Cite this article

Soft Computing Aims and scope Submit manuscript

B. A. Almogahed¹ &
I. A. Kakadiaris¹

491 Accesses
14 Citations
Explore all metrics

Abstract

In this paper, we present a method for the filteriNg of ovEr-sampled dAta using non-cooperaTive gamE theoRy (NEATER) to address the imbalanced data problem. Specifically, the problem is formulated as a non-cooperative game where all the data are players and the goal is to uniformly and consistently label all of the synthetic data created by any over-sampling technique. The proposed algorithm does not require any prior assumptions and selects representative synthetic instances while generating a very small number of noisy data. We present extensive experimental results over a large collection of datasets using three different classifiers to demonstrate the advantages of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Jesper E. van Engelen & Holger H. Hoos

References

Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL (2007) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 51(12):6166–6179
Article MATH MathSciNet Google Scholar
Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) KEEL data-mining software: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 255–287
Almogahed BA, Kakadiaris IA (2014) NEATER: filtering of over-sampled data using non-cooperative game theory. In: Proceedings of the international conference of pattern recognition, Stockholm, Sweden (in press)
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750
Article Google Scholar
Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36(3):849–851
Article Google Scholar
Batista G, Prati R, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29
Article Google Scholar
Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinform 14(1):106
Article Google Scholar
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Springer (ed) Advances in knowledge discovery and data mining. Springer, New York, pp 475–482
Chawla N, Bowyer K, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
MATH Google Scholar
Chen JJ, Tsai CA, Young JF, Kodell RL (2005) Classification ensembles for unbalanced class sizes in predictive toxicology. SAR QSAR Environ Res 16(6):517–529
Article Google Scholar
Chen X, Song E, Ma G (2010) An adaptive cost-sensitive classifier. In: Proceedings of the 2nd international conference on computer automation engineering, Singapore, pp 699–701
Christensen BC, Houseman AE, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, Nelson HH, Karagas MR, Padbury JF, Bueno R, Sugarbaker DJ, Yeh R, Wiencke JK, Kelsey KT (2009) Aging and environmental exposures alter tissue-specific dna methylation dependent upon CPG island context. PLOS Genet 5(8):e1000602
Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18
Article Google Scholar
Cressman R (1992) The stability concept of evolutionary game theory: a dynamic approach. Springer-Verlag, New York
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MATH MathSciNet Google Scholar
Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18
Article Google Scholar
Erdem A, Pelillo M (2012) Graph transduction as a noncooperative game. Neural Comput 24(3):700–723
Article MATH MathSciNet Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874
Article MathSciNet Google Scholar
García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306
Article Google Scholar
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064
Article Google Scholar
García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21
Article Google Scholar
Gordon GJ, Jensen RV, Hsiao L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967
Guyon I (2003) Design of experiments of the NIPS 2003 variable selection benchmark. NIPS 2003 workshop on feature extraction and feature selection
Guyon IS, Gunn MN, Zadeh L (2006) Feature extraction. Springer, New York
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten H (2009) WEKA data mining software. ACM SIGKDD Explor Newslett 11(1):10–18
Article Google Scholar
Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv Intell Comput (Springer) 3644:878–887
Google Scholar
Hart PE (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 515–516
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the IEEE international joint conference on neural networks, Hong Kong, pp 1322–1328
Hofbauer J, Sigmund K (2003) Evolutionary game dynamics. Bull Am Math Soc 40(4):479
Article MATH MathSciNet Google Scholar
Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th international joint conference on artificial intelligence, vol 1, Detroit
Hu B, Dong W (2014) A study on cost behaviors of binary classification measures in class-imbalanced problems. arXiv preprint arXiv:1403.7100, p 1
Howson TJ (1972) Equilibria of polymatrix games. Manag Sci 312–318
Kreps DM (1990) Game theory and economic modelling. Clarendon, Oxford
Book Google Scholar
Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, pp 179–186
Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Artif Intell Med 63–66
Lemke CE, Howson JT Jr (1964) Equilibrium points of bimatrix games. J Soc Ind Appl Math 12(2):413–423
Article MATH MathSciNet Google Scholar
Lemnaru C, Rodica P (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Springer (ed) Enterprise information systems. Springer, New York, pp 35–50
Lusa L, Blagus R (2010) Class prediction for high-dimensional class-imbalanced data. BMC Bioinform 11(1):523
Article Google Scholar
Maratea A, Petrosino A, Manzo M (2014) Adjusted f-measure and kernel scaling for imbalanced data learning. Inf Sci 257:331–341
Article Google Scholar
Meng HH, Li GZ, Wang R, Zhao X, Chen L (2008) The imbalanced problem in mass-spectrometry data analysis. In: Proceedings of the LNOR 9: the second international symposium on optimization and systems biology (OSB108), Lijiang, pp 136–143
Merz C, Murphy P, Aha D (2012) UCI repository of machine learning databases. Department of Information and Computer Science, University of California
Nash J (1951) Non-cooperative games. Ann Math 54(2):286–295
Nisan N, Roughgarden T, Tardos E, Vazirani VV (2007) Algorithmic game theory. Cambridge University Press, Cambridge
Oh S (2011) Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6):1058–1061
Article Google Scholar
Ordeshook PC (1986) Game theory and political theory: an introduction. Cambridge University Press, Cambridge
Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced data sets. Soft Comput Fusion Found Methodol Appl 13(3):213–225
Google Scholar
Porter R, Nudelman E, Shoham Y (2008) Simple search methods for finding a Nash equilibrium. Games Econ Behav 63(2):642–662
Article MATH MathSciNet Google Scholar
Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265
Article Google Scholar
Rota Bulò S, Bomze IM (2011) Infection and immunization: a new class of evolutionary game dynamics. Games Econ Behav 71(1):193–211
Article MATH Google Scholar
Smith J (1982) Evolution and the theory of games. Cambridge University Press, Cambridge
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Article Google Scholar
Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen M, Michael B, Rijn MV, Jeffrey S, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning E, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 98:10869–10874
Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099
Article Google Scholar
Wang BX, Japkowicz N (2004) Imbalanced data set learning with synthetic samples. In: Proceedings of the IRIS machine learning workshop, Canada
Weibull JW (1997) Evolutionary game theory. MIT Press, London
Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7(1):228
Article Google Scholar
Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proceedings of the hybrid intelligent systems, Rio de Janeiro, p 6
Zhang D, Liu W, Gong X, Jin H (2011) A novel improved smote resampling algorithm based on fractal. J Comput Inf Syst 7(6):2204–2211
Google Scholar

Download references

Acknowledgments

This research was funded in part by the US Department of Education (P200A070377 and P200A100119) with cost sharing provided by the University of Houston (UH) and in part by UH Hugh Roy and Lillie Cranz Cullen Endowment Fund.

Author information

Authors and Affiliations

Computational Biomedicine Lab, Department of Computer Science, University of Houston, Houston, USA
B. A. Almogahed & I. A. Kakadiaris

Authors

B. A. Almogahed
View author publications
You can also search for this author in PubMed Google Scholar
I. A. Kakadiaris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to I. A. Kakadiaris.

Additional information

Communicated by V. Loia.

Appendix

This appendix provides seven tables with the detailed results for the experimental analysis carried out in the present work. Table 12 contains the AUC values for all the databases and algorithms achieved when using the C4.5 classifier, Table 13 presents the results with the random forest, and Table 14 is for the SVM classifier. Tables 15, 16 and 17 contain the AUC values for all high-dimensional datasets for the three classifiers. The best results are highlighted in bold face. Table 18 contains the full description of the datasets used for the experimental analysis.

Table 12 AUC results for C4.5 classifier

Full size table

Table 13 AUC results for random forest classifier

Full size table

Table 14 AUC results for SVM classifier

Full size table

Table 15 AUC results on high-dimensional data for C4.5 classifier

Full size table

Table 16 AUC results on high-dimensional data for random forest classifier

Full size table

Table 17 AUC results on high-dimensional data for SVM classifier

Full size table

Table 18 Description of datasets used for the experimental analysis

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Almogahed, B.A., Kakadiaris, I.A. NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput 19, 3301–3322 (2015). https://doi.org/10.1007/s00500-014-1484-5

Download citation

Published: 19 October 2014
Issue Date: November 2015
DOI: https://doi.org/10.1007/s00500-014-1484-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

NEATER: filtering of over-sampled data using non-cooperative game theory

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

NEATER: filtering of over-sampled data using non-cooperative game theory

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation