short-paper

Free Access

Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

Authors:
Chunlan Gao

Georgia State University, Atlanta, Georgia, USA

Georgia State University, Atlanta, Georgia, USA

0009-0003-1929-4037
View Profile

,
Yong Shi

Kennesaw State University, Marietta, Georgia, USA

Kennesaw State University, Marietta, Georgia, USA

0000-0002-3980-1425
View Profile

ACM SE '24: Proceedings of the 2024 ACM Southeast ConferenceApril 2024Pages 235–240https://doi.org/10.1145/3603287.3651191

Published:27 April 2024Publication History

ACM SE '24: Proceedings of the 2024 ACM Southeast Conference

Pages 235–240

ABSTRACT

An imbalanced dataset is characterized by a substantial disparity in the distribution of examples among its classes, with one class containing significantly more instances than others. Most of the credit fraud datasets are imbalanced. Addressing the challenges posed by imbalanced datasets in classification problems is a complex task, as many classification algorithms struggle to provide satisfactory performance under such conditions. In this article, we have conducted a comparative analysis of various classifiers to assess their performance in handling imbalanced data related to credit card fraud. Then, we employed the Synthetic Minority Oversampling Technique (SMOTE) to synthesize imbalanced data into a relatively balanced dataset. Subsequently, we reevaluated the classification results using different classifiers. Ultimately, our findings revealed that the Naive Bayes classifier was less sensitive to the dataset imbalance, which the AUC score increase rate is 40.19%, the KNN classifier is the most sensitive one, the AUC score increase rate is 61.27%. In all, AdaBoost and Random Forest perform much higher AUC score, which are both higher than 95% after the SMOTE.

References

Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. 2003. Benchmarking State-of-the-art Classification Algorithms for Credit Scoring. Journal of the operational research society 54 (2003), 627--635.Google ScholarCross Ref
Alejandro Correa Bahnsen, Djamia Aouada, and Björn Ottersten. 2014. Example-dependent Cost-sensitive Logistic Regression for Credit Scoring. In 2014 13th International conference on machine learning and applications. Detroit, USA, 263--269.Google Scholar
Ricardo Barandela, Rosa M Valdovinos, J Salvador Sánchez, and Francesc J Ferri. 2004. The Imbalanced Training Sample Problem: Under or Over Sampling?. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18-20, 2004. Proceedings. Springer, 806--814.Google Scholar
Mohamed Bekkar, Hassiba Kheliouane Djemaa, and Taklit Akrouf Alitouche. 2013. Evaluation Measures for Models Assessment Over Imbalanced Data Sets. J Inf Eng Appl 3, 10 (2013).Google Scholar
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research 16 (2002), 321--357.Google ScholarCross Ref
Yoav Freund and Robert E Schapire. 1997. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.Google ScholarDigital Library
Yoav Freund, Robert E Schapire, et al. 1996. Experiments with a New Boosting Algorithm. In ICML, Vol. 96. Citeseer, Garda, Italy, 148--156.Google ScholarDigital Library
Peter Gnip, Liberios Vokorokos, and Peter Drotár. 2021. Selective Oversampling Approach for Strongly Imbalanced Data. Peer J Computer Science 7 (2021), e604.Google ScholarCross Ref
Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on Deep Learning with Class Imbalance. Journal of Big Data 6, 1 (2019), 1--54.Google ScholarCross Ref
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Peter Prettenhofer Mathieu Blondel, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarDigital Library
Enislay Ramentol, Yailé Caballero, Rafael Bello, and Francisco Herrera. 2012. Smote-rs b*: A Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-sets Using SMOTE and Rough Sets Theory. Knowledge and Information Systems 33 (2012), 245--265.Google ScholarDigital Library
Budi Santoso, Hari Wijayanto, Khairil A. Notodiputro, and Bagus Sartono. 2017. Synthetic over Sampling Methods for Handling Class Imbalanced Problems: A Review. In IOP conference series: earth and environmental science, Vol. 58. IOP Publishing, 012031.Google ScholarCross Ref
Sarkar Sobhan, Sammangi Vinay, Chawki Djeddi, and J Maiti. 2022. Classification and Pattern Extraction of Incidents: A Deep Learning-based Approach. Neural Computing and Applications 34, 17 (2022), 14253--14274.Google ScholarDigital Library
Guanjin Wang, Kok Wai Wong, and Jie Lu. 2021. AUC-Based Extreme Learning Machines for Supervised and Semi-Supervised Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 12 (2021), 7919--7930. https://doi.org/10.1109/TSMC.2020.2982226Google ScholarCross Ref
Shijun Wang, Diana Li, Nicholas Petrick, Berkman Sahiner, Marius George Linguraru, and Ronald M Summers. 2015. Optimizing Area Under the ROC Curve Using Semi-Supervised Learning. Pattern recognition 48, 1 (2015), 276--287.Google Scholar
Wenyang Wang and Dongchu Sun. 2021. The Improved AdaBoost Algorithms for Imbalanced Data Classification. Information Sciences 563 (2021), 358--374.Google ScholarCross Ref
David West. 2000. Neural Network Credit Scoring Models. Computers & operations research 27, 11-12 (2000), 1131--1152.Google Scholar
Yingxu Yang. 2007. Adaptive Credit Scoring With Kernel Learning Methods. European Journal of Operational Research 183, 3 (2007), 1521--1536.Google ScholarCross Ref
I-Cheng Yeh. 2016. Default of Credit Card Clients. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C55S3H.Google ScholarCross Ref
Lili Zhang, Trent Geisler, Herman Ray, and Ying Xie. 2022. Improving Logistic Regression on the Imbalanced Data by a Novel Penalized Log-likelihood Function. Journal of Applied Statistics 49, 13 (2022), 3257--3277.Google ScholarCross Ref
Lili Zhang, Herman Ray, Jennifer Priestley, and Soon Tan. 2020. A Descriptive Study of Variable Discretization and Cost-sensitive Logistic Regression on Imbalanced Credit Data. Journal of Applied Statistics 47, 3 (2020), 568--581.Google ScholarCross Ref
Zhaohui Zheng, Xiaoyun Wu, and Rohini Srihari. 2004. Feature Selection for Text Categorization on Imbalanced Data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 80--89.Google ScholarDigital Library

Index Terms

Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Classification and regression trees

Recommendations

Coupling different methods for overcoming the class imbalance problem

Many classification problems must deal with imbalanced datasets where one class - the majority class - outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally ...
Read More
Clustering algorithms on imbalanced data using the SMOTE technique for image segmentation
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent Systems

Imbalanced data is a critical problem in machine learning. Most imbalanced dataset consists of one or more classes, called the minority class, which do not have enough number of samples for the recognition. Many traditional classification algorithms are ...
Read More
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ACM SE '24: Proceedings of the 2024 ACM Southeast Conference
April 2024
337 pages
ISBN:9798400702372
DOI:10.1145/3603287
Organizing Chair:
Dan Lo,
Program Chair:
Eric Gamess
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Classifier
Imbalanced dataset
ROC
Synthesize
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
ACM SE '24 Paper Acceptance Rate44of137submissions,32%Overall Acceptance Rate178of377submissions,47%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 12
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

ACM SE '24: Proceedings of the 2024 ACM Southeast Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coupling different methods for overcoming the class imbalance problem

Clustering algorithms on imbalanced data using the SMOTE technique for image segmentation

Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias

ACM SE '24: Proceedings of the 2024 ACM Southeast Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coupling different methods for overcoming the class imbalance problem

Clustering algorithms on imbalanced data using the SMOTE technique for image segmentation

Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media