ABSTRACT
An imbalanced dataset is characterized by a substantial disparity in the distribution of examples among its classes, with one class containing significantly more instances than others. Most of the credit fraud datasets are imbalanced. Addressing the challenges posed by imbalanced datasets in classification problems is a complex task, as many classification algorithms struggle to provide satisfactory performance under such conditions. In this article, we have conducted a comparative analysis of various classifiers to assess their performance in handling imbalanced data related to credit card fraud. Then, we employed the Synthetic Minority Oversampling Technique (SMOTE) to synthesize imbalanced data into a relatively balanced dataset. Subsequently, we reevaluated the classification results using different classifiers. Ultimately, our findings revealed that the Naive Bayes classifier was less sensitive to the dataset imbalance, which the AUC score increase rate is 40.19%, the KNN classifier is the most sensitive one, the AUC score increase rate is 61.27%. In all, AdaBoost and Random Forest perform much higher AUC score, which are both higher than 95% after the SMOTE.
- Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. 2003. Benchmarking State-of-the-art Classification Algorithms for Credit Scoring. Journal of the operational research society 54 (2003), 627--635.Google ScholarCross Ref
- Alejandro Correa Bahnsen, Djamia Aouada, and Björn Ottersten. 2014. Example-dependent Cost-sensitive Logistic Regression for Credit Scoring. In 2014 13th International conference on machine learning and applications. Detroit, USA, 263--269.Google Scholar
- Ricardo Barandela, Rosa M Valdovinos, J Salvador Sánchez, and Francesc J Ferri. 2004. The Imbalanced Training Sample Problem: Under or Over Sampling?. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshops, SSPR 2004 and SPR 2004, Lisbon, Portugal, August 18-20, 2004. Proceedings. Springer, 806--814.Google Scholar
- Mohamed Bekkar, Hassiba Kheliouane Djemaa, and Taklit Akrouf Alitouche. 2013. Evaluation Measures for Models Assessment Over Imbalanced Data Sets. J Inf Eng Appl 3, 10 (2013).Google Scholar
- Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-sampling Technique. Journal of artificial intelligence research 16 (2002), 321--357.Google ScholarCross Ref
- Yoav Freund and Robert E Schapire. 1997. A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.Google ScholarDigital Library
- Yoav Freund, Robert E Schapire, et al. 1996. Experiments with a New Boosting Algorithm. In ICML, Vol. 96. Citeseer, Garda, Italy, 148--156.Google ScholarDigital Library
- Peter Gnip, Liberios Vokorokos, and Peter Drotár. 2021. Selective Oversampling Approach for Strongly Imbalanced Data. Peer J Computer Science 7 (2021), e604.Google ScholarCross Ref
- Justin M Johnson and Taghi M Khoshgoftaar. 2019. Survey on Deep Learning with Class Imbalance. Journal of Big Data 6, 1 (2019), 1--54.Google ScholarCross Ref
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Peter Prettenhofer Mathieu Blondel, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.Google ScholarDigital Library
- Enislay Ramentol, Yailé Caballero, Rafael Bello, and Francisco Herrera. 2012. Smote-rs b*: A Hybrid Preprocessing Approach Based on Oversampling and Undersampling for High Imbalanced Data-sets Using SMOTE and Rough Sets Theory. Knowledge and Information Systems 33 (2012), 245--265.Google ScholarDigital Library
- Budi Santoso, Hari Wijayanto, Khairil A. Notodiputro, and Bagus Sartono. 2017. Synthetic over Sampling Methods for Handling Class Imbalanced Problems: A Review. In IOP conference series: earth and environmental science, Vol. 58. IOP Publishing, 012031.Google ScholarCross Ref
- Sarkar Sobhan, Sammangi Vinay, Chawki Djeddi, and J Maiti. 2022. Classification and Pattern Extraction of Incidents: A Deep Learning-based Approach. Neural Computing and Applications 34, 17 (2022), 14253--14274.Google ScholarDigital Library
- Guanjin Wang, Kok Wai Wong, and Jie Lu. 2021. AUC-Based Extreme Learning Machines for Supervised and Semi-Supervised Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics: Systems 51, 12 (2021), 7919--7930. https://doi.org/10.1109/TSMC.2020.2982226Google ScholarCross Ref
- Shijun Wang, Diana Li, Nicholas Petrick, Berkman Sahiner, Marius George Linguraru, and Ronald M Summers. 2015. Optimizing Area Under the ROC Curve Using Semi-Supervised Learning. Pattern recognition 48, 1 (2015), 276--287.Google Scholar
- Wenyang Wang and Dongchu Sun. 2021. The Improved AdaBoost Algorithms for Imbalanced Data Classification. Information Sciences 563 (2021), 358--374.Google ScholarCross Ref
- David West. 2000. Neural Network Credit Scoring Models. Computers & operations research 27, 11-12 (2000), 1131--1152.Google Scholar
- Yingxu Yang. 2007. Adaptive Credit Scoring With Kernel Learning Methods. European Journal of Operational Research 183, 3 (2007), 1521--1536.Google ScholarCross Ref
- I-Cheng Yeh. 2016. Default of Credit Card Clients. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C55S3H.Google ScholarCross Ref
- Lili Zhang, Trent Geisler, Herman Ray, and Ying Xie. 2022. Improving Logistic Regression on the Imbalanced Data by a Novel Penalized Log-likelihood Function. Journal of Applied Statistics 49, 13 (2022), 3257--3277.Google ScholarCross Ref
- Lili Zhang, Herman Ray, Jennifer Priestley, and Soon Tan. 2020. A Descriptive Study of Variable Discretization and Cost-sensitive Logistic Regression on Imbalanced Credit Data. Journal of Applied Statistics 47, 3 (2020), 568--581.Google ScholarCross Ref
- Zhaohui Zheng, Xiaoyun Wu, and Rohini Srihari. 2004. Feature Selection for Text Categorization on Imbalanced Data. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 80--89.Google ScholarDigital Library
Index Terms
- Prediction Performance Analysis for ML Models Based on Impacts of Data Imbalance and Bias
Recommendations
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class - the majority class - outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally ...
Clustering algorithms on imbalanced data using the SMOTE technique for image segmentation
RACS '18: Proceedings of the 2018 Conference on Research in Adaptive and Convergent SystemsImbalanced data is a critical problem in machine learning. Most imbalanced dataset consists of one or more classes, called the minority class, which do not have enough number of samples for the recognition. Many traditional classification algorithms are ...
Adaptive semi-unsupervised weighted oversampling (A-SUWO) for imbalanced datasets
A new oversampling method for imbalanced dataset classification is presented.It clusters the minority class and identifies borderline minority instances.Considering majority class during minority class clustering improves oversampling.Cluster size after ...
Comments