Skip to main content

Genetic-Novelty Oversampling Technique for Imbalanced Data

  • Conference paper
  • First Online:
Proceedings of the 6th International Conference on Big Data and Internet of Things (BDIoT 2022)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 625))

Included in the following conference series:

  • 253 Accesses

Abstract

Imbalance data is in important topic vexed researchers in practice of classification problems. A data is imbalanced if the distributions of categories are not approximately equally represented. The class with small samples is called minority class, while the other classes form the majority class. Standard learning classifiers tend to misclassify the minority samples; they assume that the distribution of data is relatively balanced. However in real world application, the corrected prediction of minority samples is more valuable than correctly classify samples belonging to the majority class. In this paper, we propose GNOT a novel oversampling strategy that combines algorithm genetic concept and novelty detection technique to generate consistent with the original distribution of the minority class while avoiding outliers. We tested GNOT on seven real-world imbalanced datasets. Our experimental analysis shows that GNOT can effectively improve the performance of classifiers in terms of G-mean and F1-measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Alcalá-Fdez, J., et al.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft. Comput. 13, 307–318 (2008). https://doi.org/10.1007/s00500-008-0323-y

    Article  Google Scholar 

  2. At, E., Aljourf, M., Al-Mohanna, F., Shoukri, M.R.: Classification of imbalance data using Tomek link(T-Link) combined with random under-sampling (RUS) as a data reduction method (2016)

    Google Scholar 

  3. Baatar, N., Zhang, D., Koh, C.: An improved differential evolution algorithm adopting \(\lambda \) -best mutation strategy for global optimization of electromagnetic devices. IEEE Trans. Magn. 49(5), 2097–2100 (2013)

    Article  Google Scholar 

  4. Bernard, T., Nakib, A.: Adaptive ECG signal filtering using Bayesian based evolutionary algorithm. In: Metaheuristics for Medicine and Biology, pp. 187–211 (2017). https://doi.org/10.1007/978-3-662-54428-0_11

  5. Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 16–18 May 2000, Dallas, Texas, USA, pp. 93–104 (2000)

    Google Scholar 

  6. Cervantes, J., Li, X., Yu, W.: Using genetic algorithm to improve classification accuracy on imbalanced data. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2659–2664, October 2013

    Google Scholar 

  7. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  MATH  Google Scholar 

  8. Desforges, M.J., Jacob, P.J., Ball, A.D.: Fault detection in rotating machinery using kernel-based probability density estimation. Int. J. Syst. Sci. 31(11), 1411–1426 (2000)

    Article  MATH  Google Scholar 

  9. Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  10. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20, 18–36 (2004)

    Article  MathSciNet  Google Scholar 

  11. Guan, D., Yuan, W., Lee, Y., Lee, S.: Nearest neighbor editing aided by unlabeled data. Inf. Sci. 179(13), 2273–2282 (2009)

    Article  Google Scholar 

  12. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  13. Hawkins, D.M.: Identification of Outliers. Monographs on Applied Probability and Statistics, Springer, Cham (1980). https://doi.org/10.1007/978-94-015-3994-4

    Book  MATH  Google Scholar 

  14. Jiang, K., Lu, J., Xia, K.: A novel algorithm for imbalance data classification based on genetic algorithm improved smote. Arab. J. Sci. Eng. 41, 3255–3266 (2016)

    Article  Google Scholar 

  15. Karia, V., Zhang, W., Naeim, A., Ramezani, R.: Gensample: a genetic algorithm for oversampling in imbalanced datasets. CoRR abs/1910.10806 (2019)

    Google Scholar 

  16. Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Handling imbalanced datasets: a review (2006)

    Google Scholar 

  17. Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)

    Article  Google Scholar 

  18. Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Evaluating the effect of unbalanced data in biomedical document classification. J. Integr. Bioinform. 8(3), 105–117 (2011)

    Article  Google Scholar 

  19. Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365.html

  20. Li, Y., Guo, H., Zhang, Q., Mingyun, G., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowl.-Based Syst. 160, 1–15 (2018)

    Article  Google Scholar 

  21. Markou, M., Singh, S.: Novelty detection: a review - part 1: statistical approaches. Sig. Process. 83(12), 2481–2497 (2003)

    Article  MATH  Google Scholar 

  22. Mena, L.J., Gonzalez, J.A.: Machine learning for imbalanced datasets: application in medical diagnostic. In: FLAIRS Conference (2006)

    Google Scholar 

  23. Miljkovic, D.: Review of novelty detection methods. In: The 33rd International Convention MIPRO, pp. 593–598, May 2010

    Google Scholar 

  24. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  25. Rout, N., Mishra, D., Mallick, M.K.: Handling imbalanced data: a survey. In: Reddy, M.S., Viswanath, K., K.M., S.P. (eds.) International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications. AISC, vol. 628, pp. 431–443. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5272-9_39

    Chapter  Google Scholar 

  26. Phua, C., Alahakoon, D., Lee, V.C.S.: Minority report in fraud detection: classification of skewed data. SIGKDD Explor. 6, 50–59 (2004)

    Article  Google Scholar 

  27. Saladi, P.S.M., Dash, T.: Genetic algorithm-based oversampling technique to learn from imbalanced data. In: Bansal, J.C., Das, K.N., Nagar, A., Deep, K., Ojha, A.K. (eds.) Soft Computing for Problem Solving, pp. 387–397. Springer Singapore, Singapore (2019). https://doi.org/10.1007/978-981-13-1592-3_30

    Chapter  Google Scholar 

  28. Tomasev, N., Mladenic, D.: Class imbalance and the curse of minority hubs. Knowl.-Based Syst. 53, 157–172 (2013)

    Article  Google Scholar 

  29. V., C.N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Boston, pp. 853–867. Springer, Boston (2005). https://doi.org/10.1007/978-0-387-09823-4_45

  30. VALUATIONS, E.: A review on evaluation metrics for data classification evaluations (2015)

    Google Scholar 

  31. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  32. Wright, A.H.: Genetic algorithms for real parameter optimization. In: Proceedings of the First Workshop on Foundations of Genetic Algorithms. Bloomington Campus, Indiana, USA, 15–18 July 1990, pp. 205–218 (1990)

    Google Scholar 

  33. Zewdu, T., HiLCoE, T.B.: Prediction of HIV status in Addis Ababa using data mining technology (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hajar Ait Addi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ait Addi, H., Ezzahir, R., Boukhlik, N. (2023). Genetic-Novelty Oversampling Technique for Imbalanced Data. In: Lazaar, M., En-Naimi, E.M., Zouhair, A., Al Achhab, M., Mahboub, O. (eds) Proceedings of the 6th International Conference on Big Data and Internet of Things. BDIoT 2022. Lecture Notes in Networks and Systems, vol 625. Springer, Cham. https://doi.org/10.1007/978-3-031-28387-1_16

Download citation

Publish with us

Policies and ethics