Abstract
Classification algorithms design predictive models that classify data under one of the predefined categories. The data can be text, image, audio, video, or animation. The Data Complexity Metrics (DCM) gives insight into the different aspects of the data characteristics like data distribution, noise, overlap, and separability. Existing data complexity metrics consider linearly separable datasets to be less complex. However, for a separable multi-concept dataset, these metrics fail to generate low complexity values despite the high accuracy yielded by most of the state-of-the-art classification models. The erroneous complexity estimates generated by the existing metrics make it difficult to choose the efficient classifier model or fine-tune its parameter settings. This work addresses the complexity-accuracy discrepancy issue by formulating novel data complexity metrics for multi-concept and simple datasets. A density-based clustering algorithm (OPTICS) is initially utilized to identify the concepts in each class, which is later utilized in the formulation of the DCM in multi-concept datasets. This work also explores the relationship between the metrics and classifiers in multi-concept datasets. The classifiers’ accuracy for different data complexity levels is studied to elect a highly accurate, robust classifier for any dataset. Comprehensive coverage of the proposed technique in different data distributions with varied degrees of overlapping scenarios is examined in this paper using synthetic datasets.
Similar content being viewed by others
Data Availability
Datasets were generated synthetically and the data-generating code can be made available on request.
References
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300. https://doi.org/10.1109/34.990132
Alzubaidi L, Fadhel M, Al-Shamma O, Zhang J, Santamaría J, Duan Y (2022) Robust application of new deep learning tools: an experimental study in medical imaging. Multimedia Tools Appl 81. https://doi.org/10.1007/s11042-021-10942-9
Ibrahim E, Shouman M, Torkey H, El-Sayed A (2021) Correction to: handling missing and outliers values by enhanced algorithms for an accurate diabetic classification system. Multimedia Tools Appl 80. https://doi.org/10.1007/s11042-021-10843-x
Abedini M, Bijari A, Banirostam T (2020) Classification of Pima Indian diabetes dataset using ensemble of decision tree. Log Regression Neural Netw Ijarcce 9(7):1–4. https://doi.org/10.17148/ijarcce.2020.9701
Aversano L, Bernardi ML, Cimitile M, Iammarino M, Macchia PE, Nettore IC, Verdone C (2021) Thyroid disease treatment prediction with machine learning approaches. Procedia Comput Sci 192:1031–1040. https://doi.org/10.1016/j.procs.2021.08.106
Isaza C, Anaya K, Zavala De Paz JP, Leal J, Hernández-Ríos I, Mosquera-Artamonov J (2018) Image analysis and data mining techniques for classification of morphological and color features for seeds of the wild castor oil plant (ricinus communis l.). Multimedia Tools Appl 77. https://doi.org/10.1007/s11042-017-4438-y
Ramos J, Nedjah N, Mourelle L, Gupta BB (2018) Visual data mining for crowd anomaly detection using artificial bacteria colony. Multimedia Tools Appl 77. https://doi.org/10.1007/s11042-017-5382-6
Stanisavljevic Z, Nikolic B, Tartalja I, Milutinovic V (2013) A classification of elearning tools based on the applied multimedia. Multimedia Tools Appl 74. https://doi.org/10.1007/s11042-013-1802-4
Wu Z, Dong Y-n, Qiu X, Jin J (2022) Online multimedia traffic classification from the qos perspective using deep learning. Comput Netw 204:108716. https://doi.org/10.1016/j.comnet.2021.108716
Chen H, Zhang Z, Huang S, Hu J, Ni W, Liu J (2023) Textcnn-based ensemble learning model for japanese text multi-classification. Comput Electr Eng 109:108751. https://doi.org/10.1016/j.compeleceng.2023.108751
Jung H, Lee R, Lee S-H, Hwang W (2021) Correction to: active weighted mapping-based residual convolutional neural network for image classification. Multimedia Tools Appl 80:1–1. https://doi.org/10.1007/s11042-021-11538-z
Sánchez J, Mollineda R, Sotoca J (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10:189–201. https://doi.org/10.1007/s10044-007-0061-2
Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831. https://doi.org/10.1016/j.eswa.2013.02.025
Sarbazi-Azad S, Saniee Abadeh M, Mowlaei ME (2021) Using data complexity measures and an evolutionary cultural algorithm for gene selection in microarray data. Soft Comput Lett 3:100007. https://doi.org/10.1016/j.socl.2020.100007
Sáez JA, Luengo J, Herrera F (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364. https://doi.org/10.1016/j.patcog.2012.07.009
Jain S, Shukla S, Wadhvani R (2018) Dynamic selection of normalization techniques using data complexity measures. Expert Syst Appl 106:252–262. https://doi.org/10.1016/j.eswa.2018.04.008
Thudumu S, Branch P, Jin J, Singh J (2020) A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data 7. https://doi.org/10.1186/s40537-020-00320-x
Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539. https://doi.org/10.1109/TPAMI.2003.1251146
Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936. https://doi.org/10.1007/s00500-010-0625-8
B S, Gyanchandani M, Wadhvani R, Shukla S, (2023) Data complexity-based dynamic ensembling of svms in classification. Expert Syst Appl 216:119437. https://doi.org/10.1016/j.eswa.2022.119437
Moshtari S, Sami A, Azimi M (2013) Using complexity metrics to improve software security. Comput Fraud Secur 2013:8–17. https://doi.org/10.1016/S1361-3723(13)70045-9
Sabeti V, Samavi S, Shirani S (2012) An adaptive lsb matching steganography based on octonary complexity measure. Multimedia Tools Appl 64. https://doi.org/10.1007/s11042-011-0975-y
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Syst 117:27–45. https://doi.org/10.1016/j.knosys.2016.09.022. (Volume, Variety and Velocity in Data Science)
Dikmen I, Atasoy G, Erol H, Kaya HD, Birgonul MT (2022) A decision-support tool for risk and complexity assessment and visualization in construction projects. Comput Ind 141:103694. https://doi.org/10.1016/j.compind.2022.103694
Saini M, Susan S (2023) Tackling class imbalance in computer vision: a contemporary review. Art Intell Rev 1–57. https://doi.org/10.1007/s10462-023-10557-6
Acaroğlu H, García Márquez FP (2021) Comprehensive review on electricity market price and load forecasting based on wind energy. Energ 14(22). https://doi.org/10.3390/en14227473
Kim YS (2010) Performance evaluation for classification methods: a comparative simulation study. Expert Syst Appl 37(3):2292–2306. https://doi.org/10.1016/j.eswa.2009.07.043
Ho TK (2008) Data Complexity Analysis: linkage between context and solution in classification. Springer-Verlag, Berlin Heidelberg 2008:1–1. https://doi.org/10.1007/978-3-540-89689-0_1
Ho TK, Basu M (2000) Measuring the complexity of classification problems. Proc - Int Conf Pattern Recognit 15(2):43–46. https://doi.org/10.1109/icpr.2000.906015
Ho TK, Baird HS (1998) Pattern classification with compact distribution maps. Comput Vision Image Understand 70(1):101–110. https://doi.org/10.1006/cviu.1998.0624
Smith FW (1968) Pattern classifier design by linear programming. IEEE Trans Comput C-17(4):367–372. https://doi.org/10.1109/TC.1968.229395
Hoekstra A, Duin RPW (1996) On the nonlinearity of pattern classifiers. In: Proc. of the 13th ICPR. pp 271–275
Smith SP, Jain AK (1988) A test to determine the multivariate normality of a data set. IEEE Trans Pattern Anal Mach Intell 10(5):757–761. https://doi.org/10.1109/34.6789
Frank L, Hubert E (1996) Pretopological approach for supervised learning. In: Proceedings of 13th international conference on pattern recognition, vol 4. pp 256–2604. https://doi.org/10.1109/ICPR.1996.547426
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Cortes C, Vapnik V (1995) Support-vector networks. Mach Leaming 20:273–297. https://doi.org/10.1007/BF00994018
Meyer D, Leisch F, Hornik K (2003) The support vector machine under test. Neurocomput 55(1–2):169–186. https://doi.org/10.1016/S0925-2312(03)00431-4
Ougiaroglou S, Nanopoulos A, Papadopoulos A, Manolopoulos Y, Welzer T (2007) Adaptive k-nearest-neighbor classification using a dynamic number of nearest neighbors. pp 66–82. https://doi.org/10.1007/978-3-540-75185-4_7
Schneider K-M (2003) A comparison of event models for Naive Bayes anti-spam e-mail filtering. 307. https://doi.org/10.3115/1067807.1067848
Quinlan JR (1986) Induction of Decision Trees. Mach Learn 1(1):81–106. https://doi.org/10.1023/A:1022643204877
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Elbashir MK, Wang J, Wu F-X, Wang L (2013) Predicting beta-turns in proteins using support vector machines with fractional polynomials. Proteome Sci 11(1):1–10. https://doi.org/10.1186/1477-5956-11-S1-S5
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/bf00058655
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals Stat 29:1189–1232
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Funding/Conflict Of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
B, S., Gyanchandani, M., Wadhvani, R. et al. Data complexity measures for classification of a multi-concept dataset. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18965-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-024-18965-8