skip to main content
10.1145/3603287.3651191acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
short-paper
Free Access

Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

Published:27 April 2024Publication History

ABSTRACT

An imbalanced dataset is characterized by a substantial disparity in the distribution of examples among its classes, with one class containing significantly more instances than others. Most of the credit fraud datasets are imbalanced. Addressing the challenges posed by imbalanced datasets in classification problems is a complex task, as many classification algorithms struggle to provide satisfactory performance under such conditions. In this article, we have conducted a comparative analysis of various classifiers to assess their performance in handling imbalanced data related to credit card fraud. Then, we employed the Synthetic Minority Oversampling Technique (SMOTE) to synthesize imbalanced data into a relatively balanced dataset. Subsequently, we reevaluated the classification results using different classifiers. Ultimately, our findings revealed that the Naive Bayes classifier was less sensitive to the dataset imbalance, which the AUC score increase rate is 40.19%, the KNN classifier is the most sensitive one, the AUC score increase rate is 61.27%. In all, AdaBoost and Random Forest perform much higher AUC score, which are both higher than 95% after the SMOTE.

References

  1. Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. 2003. Benchmarking State-of-the-art Classification Algorithms for Credit Scoring. Journal of the operational research society 54 (2003), 627--635.Google ScholarGoogle ScholarCross RefCross Ref
  2. Alejandro Correa Bahnsen, Djamia Aouada, and Björn Ottersten. 2014. Example-dependent Cost-sensitive Logistic Regression for Credit Scoring. In 2014 13th International conference on machine learning and applications. Detroit, USA, 263--269.Google ScholarGoogle Scholar
  3. Ricardo Barandela, Rosa M Valdovinos, J Salvador Sánchez, and Francesc J Ferri. 2004. The Imbalanced Training Sample Problem: Under or Over Sampling?. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18-20, 2004. Proceedings. Springer, 806--814.Google ScholarGoogle Scholar
  4. Mohamed Bekkar, Hassiba Kheliouane Djemaa, and Taklit Akrouf Alitouche. 2013. Evaluation Measures for Models Assessment Over Imbalanced Data Sets. J Inf Eng Appl 3, 10 (2013).Google ScholarGoogle Scholar
  5. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research 16 (2002), 321--357.Google ScholarGoogle ScholarCross RefCross Ref
  6. Yoav Freund and Robert E Schapire. 1997. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yoav Freund, Robert E Schapire, et al. 1996. Experiments with a New Boosting Algorithm. In ICML, Vol. 96. Citeseer, Garda, Italy, 148--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Peter Gnip, Liberios Vokorokos, and Peter Drotár. 2021. Selective Oversampling Approach for Strongly Imbalanced Data. Peer J Computer Science 7 (2021), e604.Google ScholarGoogle ScholarCross RefCross Ref
  9. Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on Deep Learning with Class Imbalance. Journal of Big Data 6, 1 (2019), 1--54.Google ScholarGoogle ScholarCross RefCross Ref
  10. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Peter Prettenhofer Mathieu Blondel, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Enislay Ramentol, Yailé Caballero, Rafael Bello, and Francisco Herrera. 2012. Smote-rs b*: A Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-sets Using SMOTE and Rough Sets Theory. Knowledge and Information Systems 33 (2012), 245--265.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Budi Santoso, Hari Wijayanto, Khairil A. Notodiputro, and Bagus Sartono. 2017. Synthetic over Sampling Methods for Handling Class Imbalanced Problems: A Review. In IOP conference series: earth and environmental science, Vol. 58. IOP Publishing, 012031.Google ScholarGoogle ScholarCross RefCross Ref
  13. Sarkar Sobhan, Sammangi Vinay, Chawki Djeddi, and J Maiti. 2022. Classification and Pattern Extraction of Incidents: A Deep Learning-based Approach. Neural Computing and Applications 34, 17 (2022), 14253--14274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Guanjin Wang, Kok Wai Wong, and Jie Lu. 2021. AUC-Based Extreme Learning Machines for Supervised and Semi-Supervised Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 12 (2021), 7919--7930. https://doi.org/10.1109/TSMC.2020.2982226Google ScholarGoogle ScholarCross RefCross Ref
  15. Shijun Wang, Diana Li, Nicholas Petrick, Berkman Sahiner, Marius George Linguraru, and Ronald M Summers. 2015. Optimizing Area Under the ROC Curve Using Semi-Supervised Learning. Pattern recognition 48, 1 (2015), 276--287.Google ScholarGoogle Scholar
  16. Wenyang Wang and Dongchu Sun. 2021. The Improved AdaBoost Algorithms for Imbalanced Data Classification. Information Sciences 563 (2021), 358--374.Google ScholarGoogle ScholarCross RefCross Ref
  17. David West. 2000. Neural Network Credit Scoring Models. Computers & operations research 27, 11-12 (2000), 1131--1152.Google ScholarGoogle Scholar
  18. Yingxu Yang. 2007. Adaptive Credit Scoring With Kernel Learning Methods. European Journal of Operational Research 183, 3 (2007), 1521--1536.Google ScholarGoogle ScholarCross RefCross Ref
  19. I-Cheng Yeh. 2016. Default of Credit Card Clients. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C55S3H.Google ScholarGoogle ScholarCross RefCross Ref
  20. Lili Zhang, Trent Geisler, Herman Ray, and Ying Xie. 2022. Improving Logistic Regression on the Imbalanced Data by a Novel Penalized Log-likelihood Function. Journal of Applied Statistics 49, 13 (2022), 3257--3277.Google ScholarGoogle ScholarCross RefCross Ref
  21. Lili Zhang, Herman Ray, Jennifer Priestley, and Soon Tan. 2020. A Descriptive Study of Variable Discretization and Cost-sensitive Logistic Regression on Imbalanced Credit Data. Journal of Applied Statistics 47, 3 (2020), 568--581.Google ScholarGoogle ScholarCross RefCross Ref
  22. Zhaohui Zheng, Xiaoyun Wu, and Rohini Srihari. 2004. Feature Selection for Text Categorization on Imbalanced Data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 80--89.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ACM SE '24: Proceedings of the 2024 ACM Southeast Conference
      April 2024
      337 pages
      ISBN:9798400702372
      DOI:10.1145/3603287

      Copyright © 2024 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 April 2024

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited

      Acceptance Rates

      ACM SE '24 Paper Acceptance Rate44of137submissions,32%Overall Acceptance Rate178of377submissions,47%
    • Article Metrics

      • Downloads (Last 12 months)12
      • Downloads (Last 6 weeks)12

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader