Skip to main content
Log in

NEATER: filtering of over-sampled data using non-cooperative game theory

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

In this paper, we present a method for the filteriNg of ovEr-sampled dAta using non-cooperaTive gamE theoRy (NEATER) to address the imbalanced data problem. Specifically, the problem is formulated as a non-cooperative game where all the data are players and the goal is to uniformly and consistently label all of the synthetic data created by any over-sampling technique. The proposed algorithm does not require any prior assumptions and selects representative synthetic instances while generating a very small number of noisy data. We present extensive experimental results over a large collection of datasets using three different classifiers to demonstrate the advantages of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL (2007) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 51(12):6166–6179

    Article  MATH  MathSciNet  Google Scholar 

  • Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) KEEL data-mining software: data set repository, integration of algorithms and experimental analysis framework. J Multiple Valued Logic Soft Comput 255–287

  • Almogahed BA, Kakadiaris IA (2014) NEATER: filtering of over-sampled data using non-cooperative game theory. In: Proceedings of the international conference of pattern recognition, Stockholm, Sweden (in press)

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750

    Article  Google Scholar 

  • Barandela R, Sánchez JS, García V, Rangel E (2003) Strategies for learning in class imbalance problems. Pattern Recogn 36(3):849–851

    Article  Google Scholar 

  • Batista G, Prati R, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29

    Article  Google Scholar 

  • Blagus R, Lusa L (2013) SMOTE for high-dimensional class-imbalanced data. BMC Bioinform 14(1):106

    Article  Google Scholar 

  • Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Springer (ed) Advances in knowledge discovery and data mining. Springer, New York, pp 475–482

  • Chawla N, Bowyer K, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • Chen JJ, Tsai CA, Young JF, Kodell RL (2005) Classification ensembles for unbalanced class sizes in predictive toxicology. SAR QSAR Environ Res 16(6):517–529

    Article  Google Scholar 

  • Chen X, Song E, Ma G (2010) An adaptive cost-sensitive classifier. In: Proceedings of the 2nd international conference on computer automation engineering, Singapore, pp 699–701

  • Christensen BC, Houseman AE, Marsit CJ, Zheng S, Wrensch MR, Wiemels JL, Nelson HH, Karagas MR, Padbury JF, Bueno R, Sugarbaker DJ, Yeh R, Wiencke JK, Kelsey KT (2009) Aging and environmental exposures alter tissue-specific dna methylation dependent upon CPG island context. PLOS Genet 5(8):e1000602

  • Cohen G, Hilario M, Sax H, Hugonnet S, Geissbuhler A (2006) Learning from imbalanced data in surveillance of nosocomial infection. Artif Intell Med 37(1):7–18

    Article  Google Scholar 

  • Cressman R (1992) The stability concept of evolutionary game theory: a dynamic approach. Springer-Verlag, New York

  • Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Derrac J, García S, Molina D, Herrera F (2011) A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput 1(1):3–18

    Article  Google Scholar 

  • Erdem A, Pelillo M (2012) Graph transduction as a noncooperative game. Neural Comput 24(3):700–723

    Article  MATH  MathSciNet  Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27(8):861–874

    Article  MathSciNet  Google Scholar 

  • García S, Herrera F (2009) Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy. Evol Comput 17(3):275–306

    Article  Google Scholar 

  • García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  • García V, Sánchez JS, Mollineda RA (2012) On the effectiveness of preprocessing methods when dealing with different levels of class imbalance. Knowl Based Syst 25(1):13–21

    Article  Google Scholar 

  • Gordon GJ, Jensen RV, Hsiao L, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967

  • Guyon I (2003) Design of experiments of the NIPS 2003 variable selection benchmark. NIPS 2003 workshop on feature extraction and feature selection

  • Guyon IS, Gunn MN, Zadeh L (2006) Feature extraction. Springer, New York

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten H (2009) WEKA data mining software. ACM SIGKDD Explor Newslett 11(1):10–18

    Article  Google Scholar 

  • Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. Adv Intell Comput (Springer) 3644:878–887

    Google Scholar 

  • Hart PE (1968) The condensed nearest neighbour rule. IEEE Trans Inf Theory 515–516

  • He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284

    Article  Google Scholar 

  • He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the IEEE international joint conference on neural networks, Hong Kong, pp 1322–1328

  • Hofbauer J, Sigmund K (2003) Evolutionary game dynamics. Bull Am Math Soc 40(4):479

    Article  MATH  MathSciNet  Google Scholar 

  • Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of the 11th international joint conference on artificial intelligence, vol 1, Detroit

  • Hu B, Dong W (2014) A study on cost behaviors of binary classification measures in class-imbalanced problems. arXiv preprint arXiv:1403.7100, p 1

  • Howson TJ (1972) Equilibria of polymatrix games. Manag Sci 312–318

  • Kreps DM (1990) Game theory and economic modelling. Clarendon, Oxford

    Book  Google Scholar 

  • Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th international conference on machine learning, pp 179–186

  • Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Artif Intell Med 63–66

  • Lemke CE, Howson JT Jr (1964) Equilibrium points of bimatrix games. J Soc Ind Appl Math 12(2):413–423

    Article  MATH  MathSciNet  Google Scholar 

  • Lemnaru C, Rodica P (2012) Imbalanced classification problems: systematic study, issues and best practices. In: Springer (ed) Enterprise information systems. Springer, New York, pp 35–50

  • Lusa L, Blagus R (2010) Class prediction for high-dimensional class-imbalanced data. BMC Bioinform 11(1):523

    Article  Google Scholar 

  • Maratea A, Petrosino A, Manzo M (2014) Adjusted f-measure and kernel scaling for imbalanced data learning. Inf Sci 257:331–341

    Article  Google Scholar 

  • Meng HH, Li GZ, Wang R, Zhao X, Chen L (2008) The imbalanced problem in mass-spectrometry data analysis. In: Proceedings of the LNOR 9: the second international symposium on optimization and systems biology (OSB108), Lijiang, pp 136–143

  • Merz C, Murphy P, Aha D (2012) UCI repository of machine learning databases. Department of Information and Computer Science, University of California

  • Nash J (1951) Non-cooperative games. Ann Math 54(2):286–295

  • Nisan N, Roughgarden T, Tardos E, Vazirani VV (2007) Algorithmic game theory. Cambridge University Press, Cambridge

  • Oh S (2011) Error back-propagation algorithm for classification of imbalanced data. Neurocomputing 74(6):1058–1061

    Article  Google Scholar 

  • Ordeshook PC (1986) Game theory and political theory: an introduction. Cambridge University Press, Cambridge

  • Orriols-Puig A, Bernadó-Mansilla E (2009) Evolutionary rule-based systems for imbalanced data sets. Soft Comput Fusion Found Methodol Appl 13(3):213–225

    Google Scholar 

  • Porter R, Nudelman E, Shoham Y (2008) Simple search methods for finding a Nash equilibrium. Games Econ Behav 63(2):642–662

    Article  MATH  MathSciNet  Google Scholar 

  • Ramentol E, Caballero Y, Bello R, Herrera F (2012) Smote-rsb*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl Inf Syst 33(2):245–265

    Article  Google Scholar 

  • Rota Bulò S, Bomze IM (2011) Infection and immunization: a new class of evolutionary game dynamics. Games Econ Behav 71(1):193–211

    Article  MATH  Google Scholar 

  • Smith J (1982) Evolution and the theory of games. Cambridge University Press, Cambridge

  • Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437

    Article  Google Scholar 

  • Sorlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen M, Michael B, Rijn MV, Jeffrey S, Thorsen T, Quist H, Matese JC, Brown PO, Botstein D, Lønning E, Børresen-Dale A (2001) Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci 98:10869–10874

  • Tao D, Tang X, Li X, Wu X (2006) Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval. IEEE Trans Pattern Anal Mach Intell 28(7):1088–1099

    Article  Google Scholar 

  • Wang BX, Japkowicz N (2004) Imbalanced data set learning with synthetic samples. In: Proceedings of the IRIS machine learning workshop, Canada

  • Weibull JW (1997) Evolutionary game theory. MIT Press, London

  • Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7(1):228

    Article  Google Scholar 

  • Yoon K, Kwek S (2005) An unsupervised learning approach to resolving the data imbalanced issue in supervised learning problems in functional genomics. In: Proceedings of the hybrid intelligent systems, Rio de Janeiro, p 6

  • Zhang D, Liu W, Gong X, Jin H (2011) A novel improved smote resampling algorithm based on fractal. J Comput Inf Syst 7(6):2204–2211

    Google Scholar 

Download references

Acknowledgments

This research was funded in part by the US Department of Education (P200A070377 and P200A100119) with cost sharing provided by the University of Houston (UH) and in part by UH Hugh Roy and Lillie Cranz Cullen Endowment Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. A. Kakadiaris.

Additional information

Communicated by V. Loia.

Appendix

Appendix

This appendix provides seven tables with the detailed results for the experimental analysis carried out in the present work. Table 12 contains the AUC values for all the databases and algorithms achieved when using the C4.5 classifier, Table 13 presents the results with the random forest, and Table 14 is for the SVM classifier. Tables 15, 16 and 17 contain the AUC values for all high-dimensional datasets for the three classifiers. The best results are highlighted in bold face. Table 18 contains the full description of the datasets used for the experimental analysis.

Table 12 AUC results for C4.5 classifier
Table 13 AUC results for random forest classifier
Table 14 AUC results for SVM classifier
Table 15 AUC results on high-dimensional data for C4.5 classifier
Table 16 AUC results on high-dimensional data for random forest classifier
Table 17 AUC results on high-dimensional data for SVM classifier
Table 18 Description of datasets used for the experimental analysis

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Almogahed, B.A., Kakadiaris, I.A. NEATER: filtering of over-sampled data using non-cooperative game theory. Soft Comput 19, 3301–3322 (2015). https://doi.org/10.1007/s00500-014-1484-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-014-1484-5

Keywords

Navigation