Abstract
Imbalance data is in important topic vexed researchers in practice of classification problems. A data is imbalanced if the distributions of categories are not approximately equally represented. The class with small samples is called minority class, while the other classes form the majority class. Standard learning classifiers tend to misclassify the minority samples; they assume that the distribution of data is relatively balanced. However in real world application, the corrected prediction of minority samples is more valuable than correctly classify samples belonging to the majority class. In this paper, we propose GNOT a novel oversampling strategy that combines algorithm genetic concept and novelty detection technique to generate consistent with the original distribution of the minority class while avoiding outliers. We tested GNOT on seven real-world imbalanced datasets. Our experimental analysis shows that GNOT can effectively improve the performance of classifiers in terms of G-mean and F1-measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alcalá-Fdez, J., et al.: KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft. Comput. 13, 307–318 (2008). https://doi.org/10.1007/s00500-008-0323-y
At, E., Aljourf, M., Al-Mohanna, F., Shoukri, M.R.: Classification of imbalance data using Tomek link(T-Link) combined with random under-sampling (RUS) as a data reduction method (2016)
Baatar, N., Zhang, D., Koh, C.: An improved differential evolution algorithm adopting \(\lambda \) -best mutation strategy for global optimization of electromagnetic devices. IEEE Trans. Magn. 49(5), 2097–2100 (2013)
Bernard, T., Nakib, A.: Adaptive ECG signal filtering using Bayesian based evolutionary algorithm. In: Metaheuristics for Medicine and Biology, pp. 187–211 (2017). https://doi.org/10.1007/978-3-662-54428-0_11
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, 16–18 May 2000, Dallas, Texas, USA, pp. 93–104 (2000)
Cervantes, J., Li, X., Yu, W.: Using genetic algorithm to improve classification accuracy on imbalanced data. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics, pp. 2659–2664, October 2013
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Desforges, M.J., Jacob, P.J., Ball, A.D.: Fault detection in rotating machinery using kernel-based probability density estimation. Int. J. Syst. Sci. 31(11), 1411–1426 (2000)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 20, 18–36 (2004)
Guan, D., Yuan, W., Lee, Y., Lee, S.: Nearest neighbor editing aided by unlabeled data. Inf. Sci. 179(13), 2273–2282 (2009)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Hawkins, D.M.: Identification of Outliers. Monographs on Applied Probability and Statistics, Springer, Cham (1980). https://doi.org/10.1007/978-94-015-3994-4
Jiang, K., Lu, J., Xia, K.: A novel algorithm for imbalance data classification based on genetic algorithm improved smote. Arab. J. Sci. Eng. 41, 3255–3266 (2016)
Karia, V., Zhang, W., Naeim, A., Ramezani, R.: Gensample: a genetic algorithm for oversampling in imbalanced datasets. CoRR abs/1910.10806 (2019)
Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Handling imbalanced datasets: a review (2006)
Kubat, M., Holte, R.C., Matwin, S.: Machine learning for the detection of oil spills in satellite radar images. Mach. Learn. 30, 195–215 (1998)
Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Evaluating the effect of unbalanced data in biomedical document classification. J. Integr. Bioinform. 8(3), 105–117 (2011)
Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017). http://jmlr.org/papers/v18/16-365.html
Li, Y., Guo, H., Zhang, Q., Mingyun, G., Yang, J.: Imbalanced text sentiment classification using universal and domain-specific knowledge. Knowl.-Based Syst. 160, 1–15 (2018)
Markou, M., Singh, S.: Novelty detection: a review - part 1: statistical approaches. Sig. Process. 83(12), 2481–2497 (2003)
Mena, L.J., Gonzalez, J.A.: Machine learning for imbalanced datasets: application in medical diagnostic. In: FLAIRS Conference (2006)
Miljkovic, D.: Review of novelty detection methods. In: The 33rd International Convention MIPRO, pp. 593–598, May 2010
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rout, N., Mishra, D., Mallick, M.K.: Handling imbalanced data: a survey. In: Reddy, M.S., Viswanath, K., K.M., S.P. (eds.) International Proceedings on Advances in Soft Computing, Intelligent Systems and Applications. AISC, vol. 628, pp. 431–443. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5272-9_39
Phua, C., Alahakoon, D., Lee, V.C.S.: Minority report in fraud detection: classification of skewed data. SIGKDD Explor. 6, 50–59 (2004)
Saladi, P.S.M., Dash, T.: Genetic algorithm-based oversampling technique to learn from imbalanced data. In: Bansal, J.C., Das, K.N., Nagar, A., Deep, K., Ojha, A.K. (eds.) Soft Computing for Problem Solving, pp. 387–397. Springer Singapore, Singapore (2019). https://doi.org/10.1007/978-981-13-1592-3_30
Tomasev, N., Mladenic, D.: Class imbalance and the curse of minority hubs. Knowl.-Based Syst. 53, 157–172 (2013)
V., C.N.: Data mining for imbalanced datasets: an overview. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook. Springer, Boston, pp. 853–867. Springer, Boston (2005). https://doi.org/10.1007/978-0-387-09823-4_45
VALUATIONS, E.: A review on evaluation metrics for data classification evaluations (2015)
Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)
Wright, A.H.: Genetic algorithms for real parameter optimization. In: Proceedings of the First Workshop on Foundations of Genetic Algorithms. Bloomington Campus, Indiana, USA, 15–18 July 1990, pp. 205–218 (1990)
Zewdu, T., HiLCoE, T.B.: Prediction of HIV status in Addis Ababa using data mining technology (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Ait Addi, H., Ezzahir, R., Boukhlik, N. (2023). Genetic-Novelty Oversampling Technique for Imbalanced Data. In: Lazaar, M., En-Naimi, E.M., Zouhair, A., Al Achhab, M., Mahboub, O. (eds) Proceedings of the 6th International Conference on Big Data and Internet of Things. BDIoT 2022. Lecture Notes in Networks and Systems, vol 625. Springer, Cham. https://doi.org/10.1007/978-3-031-28387-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-28387-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-28386-4
Online ISBN: 978-3-031-28387-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)