HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

Al Majzoub, Hisham; Elgedawy, Islam; Akaydın, Öykü; Köse Ulukök, Mehtap

doi:10.1007/s13369-019-04336-1

HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

Research Article-Computer Engineering and Computer Science
Published: 10 January 2020

Volume 45, pages 3205–3222, (2020)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Hisham Al Majzoub ORCID: orcid.org/0000-0001-8838-7366¹,
Islam Elgedawy²,
Öykü Akaydın³ &
…
Mehtap Köse Ulukök⁴

1299 Accesses
26 Citations
Explore all metrics

Abstract

Binary datasets are considered imbalanced when one of their two classes has less than 40% of the total number of the data instances (i.e., minority class). Existing classification algorithms are biased when applied on imbalanced binary datasets, as they misclassify instances of minority class. Many techniques are proposed to minimize the bias and to increase the classification accuracy. Synthetic Minority Oversampling Technique (SMOTE) is a well-known approach proposed to address this problem. It generates new synthetic data instances to balance the dataset. Unfortunately, it generates these instances randomly, leading to the generation of useless new instances, which is time and memory consuming. Different SMOTE derivatives were proposed to overcome this problem (such as Borderline SMOTE), yet the number of generated instances slightly changed. To overcome such problem, this paper proposes a novel approach for generating synthesized data instances known as Hybrid Clustered Affinitive Borderline SMOTE (HCAB-SMOTE). It managed to minimize the number of generated instances while increasing the classification accuracy. It combines undersampling for removing majority noise instances and oversampling approaches to enhance the density of the borderline. It uses k-means clustering on the borderline area and identify which clusters to oversample to achieve better results. Experimental results show that HCAB-SMOTE outperformed SMOTE, Borderline SMOTE, AB-SMOTE and CAB-SMOTE approaches which were developed before reaching HCAB-SMOTE, as it provided the highest classification accuracy with the least number of generated instances.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MaMiPot: a paradigm shift for the classification of imbalanced data

Article 07 December 2022

PDR-SMOTE: an imbalanced data processing method based on data region partition and K nearest neighbors

Article 14 June 2023

SMOTE-D a Deterministic Version of SMOTE

References

Sun, A.; Lim, E.P.; Liu, Y.: On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support Syst. 48(1), 191–201 (2009)
Article Google Scholar
Tek, F.B.; Dempster, A.G.; Kale, I.: Parasite detection and identification for automated thin blood film malaria diagnosis. Comput. Vis. Image Underst. 114(1), 21–32 (2010)
Article Google Scholar
Qureshi, S.A.; Rehman, A.S.; Qamar, A.M.; Kamal, A.; Rehman, A.: Telecommunication subscribers’ churn prediction model using machine learning. In: Eighth International Conference Digital Information Management (ICDIM 2013), September, pp. 131–136 (2013)
“Keel Datasets, Wine Quality.” https://sci2s.ugr.es/keel/dataset.php?cod=1322. Accessed 21 Aug 2019
Bekkar, M.; Alitouche, D.; Akrouf, T.; AkroufAlitouche, T.: Imbalanced data learning approaches review. Data Min. Knowl. 3(4), 15–33 (2013)
Google Scholar
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C.: Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5476 LNAI, pp. 475–482 (2009)
Chawla, N.V.; Lazarevic, A.; Hall, L.O.; Bowyer, K.W.: SMOTEBoost: improving prediction of the minority class in boosting, pp. 107–119 (2003)
Han, H.; Wang, W.; Mao, B.: Borderline-SMOTE : a new over-sampling method in imbalanced data sets learning, pp. 878–887 (2005)
Bach, M.; Werner, A.; Żywiec, J.; Pluskiewicz, W.: The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 384, 174–190 (2017)
Article Google Scholar
Douzas, G.; Bacao, F.; Last, F.: Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 465, 1–20 (2018)
Article Google Scholar
Elhassan, A.T.; Aljourf, M.: Classification of imbalance data using Tomek Link (T-Link) combined with random under-sampling (RUS) as a data reduction method. J. Inform. Data Min. 1(2), 1–12 (2016)
Article Google Scholar
Oskouei, R.J.; Bigham, B.S.: Over-sampling via under-sampling in strongly imbalanced data. Int. J. Adv. Intell. Paradig. 9(1), 58 (2017)
Article Google Scholar
Japkowicz, N.: Learning from imbalanced data sets: a comparison of various strategies. In: AAAI Workshop Learning from Imbalanced Data Sets, vol. 68, pp. 10–15 (2000)
Stefanowski, J.; Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Data Warehousing and Knowledge Discovery (Lecture Notes Computer Science Series 5182), pp. 283–292 (2008)
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Inf. Sci. (2001)
“Weka.” https://www.cs.waikato.ac.nz/ml/weka/index.html. Accessed 7 Jan 2020
Fernández, A.; López, V.; Galar, M.; Del Jesus, M.J.; Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: binarization techniques and ad-hoc approaches. Knowl. Based Syst. 42, 97–110 (2013)
Article Google Scholar
Keel Datasets, Abalone9-18. http://sci2s.ugr.es/keel/dataset.php?cod=116. Accessed 21 Aug 2019
Crowd Analytix. http://www.crowdanalytix.com/contests/why-customer-churn/. Accessed:21 Aug 2019
IBM Analytics Telco Customer Churn Dataset. https://www.kaggle.com/blastchar/telco-customer-churn#WA_Fn-UseC_-Telco-Customer-Churn.csv. Accessed 21 Aug 2019
Dua, C.; Dheeru; Graff: UCI Machine Learning Repository (2017). http://archive.ics.uci.edu/ml. Accessed 21 Aug 2019
Haberman, S.J.: Generalized residuals for log-linear models. In: Proceedings of 9th International Conference on Biometrics, pp. 104–122 (1976)
IBM & Kaggle Employee Attrition. https://www.kaggle.com/patelprashant/employee-attrition#WA_Fn-UseC_-HR-Employee-Attrition.csv. Accessed 21 Aug 2019
Keel Datasets, Solar Flare. https://sci2s.ugr.es/keel/dataset.php?cod=1331#sub1. Accessed 21 Aug 2019
IBM Analytic, Win Loss. https://github.com/vkrit/data-science-class/blob/master/WA_Fn-UseC_-Sales-Win-Loss.csv. Accessed 21 Aug 2019
Moro, S.; Laureano, R.M.S.; Cortez, P.: Using data mining for bank direct marketing: An application of the CRISP-DM methodology. In: ESM 2011–2011 European Simulation and Modelling Conference 2011, no. Figure 1, pp. 117–121 (2011)
Kohavi, R.; Becker, B.: Adult Census Income (1996). http://archive.ics.uci.edu/ml/datasets/Adult. Accessed 7 Jan 2020
K. A. E. A. Challenge: No Title. https://www.kaggle.com/c/amazon-employee-access-challenge. Accessed 7 Jan 2020
Cup, K.: No Title (2012). https://www.openml.org/d/1220. Accessed 7 Jan 2020
Cervantes, J.; Garcia-Lamont, F.; Rodriguez, L.; López, A.; Castilla, J.R.; Trueba, A.: PSO-based method for SVM classification on skewed data sets. Neurocomputing 228, 187–197 (2017)
Article Google Scholar
López, V.; Fernández, A.; Herrera, F.: On the importance of the validation technique for classification with imbalanced datasets: addressing covariate shift when data is skewed. Inf. Sci. 257, 1–13 (2014)
Article Google Scholar
Saito, T.; Rehmsmeier, M.: The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3), 1–21 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Management Information Systems Department, School of Applied Sciences, Cyprus International University, Via Mersin 10, Nicosia, Turkey
Hisham Al Majzoub
Computer Engineering Department, Middle East Technical University, Northern Cyprus Campus, 99738, Kalkanlı, Guzelyurt, Mersin 10, Turkey
Islam Elgedawy
Department of Computer Engineering, Cyprus International University, Via Mersin 10, Nicosia, Turkey
Öykü Akaydın
Department of Software Engineering, University of City Island, Via Mersin 10, Famagusta, Turkey
Mehtap Köse Ulukök

Authors

Hisham Al Majzoub
View author publications
You can also search for this author in PubMed Google Scholar
Islam Elgedawy
View author publications
You can also search for this author in PubMed Google Scholar
Öykü Akaydın
View author publications
You can also search for this author in PubMed Google Scholar
Mehtap Köse Ulukök
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hisham Al Majzoub.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Al Majzoub, H., Elgedawy, I., Akaydın, Ö. et al. HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification. Arab J Sci Eng 45, 3205–3222 (2020). https://doi.org/10.1007/s13369-019-04336-1

Download citation

Received: 12 September 2019
Accepted: 31 December 2019
Published: 10 January 2020
Issue Date: April 2020
DOI: https://doi.org/10.1007/s13369-019-04336-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

Abstract

Access this article

Similar content being viewed by others

MaMiPot: a paradigm shift for the classification of imbalanced data

PDR-SMOTE: an imbalanced data processing method based on data region partition and K nearest neighbors

SMOTE-D a Deterministic Version of SMOTE

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

HCAB-SMOTE: A Hybrid Clustered Affinitive Borderline SMOTE Approach for Imbalanced Data Binary Classification

Abstract

Access this article

Similar content being viewed by others

MaMiPot: a paradigm shift for the classification of imbalanced data

PDR-SMOTE: an imbalanced data processing method based on data region partition and K nearest neighbors

SMOTE-D a Deterministic Version of SMOTE

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation