Data complexity measures for classification of a multi-concept dataset

B, Sowkarthika; Gyanchandani, Manasi; Wadhvani, Rajesh; Shukla, Sanyam

doi:10.1007/s11042-024-18965-8

Data complexity measures for classification of a multi-concept dataset

Published: 08 April 2024

(2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Sowkarthika B ORCID: orcid.org/0000-0002-7579-9780¹,
Manasi Gyanchandani¹^na1,
Rajesh Wadhvani¹^na1 &
…
Sanyam Shukla¹^na1

63 Accesses
Explore all metrics

Abstract

Classification algorithms design predictive models that classify data under one of the predefined categories. The data can be text, image, audio, video, or animation. The Data Complexity Metrics (DCM) gives insight into the different aspects of the data characteristics like data distribution, noise, overlap, and separability. Existing data complexity metrics consider linearly separable datasets to be less complex. However, for a separable multi-concept dataset, these metrics fail to generate low complexity values despite the high accuracy yielded by most of the state-of-the-art classification models. The erroneous complexity estimates generated by the existing metrics make it difficult to choose the efficient classifier model or fine-tune its parameter settings. This work addresses the complexity-accuracy discrepancy issue by formulating novel data complexity metrics for multi-concept and simple datasets. A density-based clustering algorithm (OPTICS) is initially utilized to identify the concepts in each class, which is later utilized in the formulation of the DCM in multi-concept datasets. This work also explores the relationship between the metrics and classifiers in multi-concept datasets. The classifiers’ accuracy for different data complexity levels is studied to elect a highly accurate, robust classifier for any dataset. Comprehensive coverage of the proposed technique in different data distributions with varied degrees of overlapping scenarios is examined in this paper using synthetic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Complexity Measure for Binary Classification Problems Based on Lost Points

Hostility measure for multi-level study of data complexity

Article Open access 26 July 2022

Local Feature Selection by Formal Concept Analysis for Multi-class Classification

Data Availability

Datasets were generated synthetically and the data-generating code can be made available on request.

References

Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300. https://doi.org/10.1109/34.990132
Article Google Scholar
Alzubaidi L, Fadhel M, Al-Shamma O, Zhang J, Santamaría J, Duan Y (2022) Robust application of new deep learning tools: an experimental study in medical imaging. Multimedia Tools Appl 81. https://doi.org/10.1007/s11042-021-10942-9
Ibrahim E, Shouman M, Torkey H, El-Sayed A (2021) Correction to: handling missing and outliers values by enhanced algorithms for an accurate diabetic classification system. Multimedia Tools Appl 80. https://doi.org/10.1007/s11042-021-10843-x
Abedini M, Bijari A, Banirostam T (2020) Classification of Pima Indian diabetes dataset using ensemble of decision tree. Log Regression Neural Netw Ijarcce 9(7):1–4. https://doi.org/10.17148/ijarcce.2020.9701
Aversano L, Bernardi ML, Cimitile M, Iammarino M, Macchia PE, Nettore IC, Verdone C (2021) Thyroid disease treatment prediction with machine learning approaches. Procedia Comput Sci 192:1031–1040. https://doi.org/10.1016/j.procs.2021.08.106
Article Google Scholar
Isaza C, Anaya K, Zavala De Paz JP, Leal J, Hernández-Ríos I, Mosquera-Artamonov J (2018) Image analysis and data mining techniques for classification of morphological and color features for seeds of the wild castor oil plant (ricinus communis l.). Multimedia Tools Appl 77. https://doi.org/10.1007/s11042-017-4438-y
Ramos J, Nedjah N, Mourelle L, Gupta BB (2018) Visual data mining for crowd anomaly detection using artificial bacteria colony. Multimedia Tools Appl 77. https://doi.org/10.1007/s11042-017-5382-6
Stanisavljevic Z, Nikolic B, Tartalja I, Milutinovic V (2013) A classification of elearning tools based on the applied multimedia. Multimedia Tools Appl 74. https://doi.org/10.1007/s11042-013-1802-4
Wu Z, Dong Y-n, Qiu X, Jin J (2022) Online multimedia traffic classification from the qos perspective using deep learning. Comput Netw 204:108716. https://doi.org/10.1016/j.comnet.2021.108716
Article Google Scholar
Chen H, Zhang Z, Huang S, Hu J, Ni W, Liu J (2023) Textcnn-based ensemble learning model for japanese text multi-classification. Comput Electr Eng 109:108751. https://doi.org/10.1016/j.compeleceng.2023.108751
Article Google Scholar
Jung H, Lee R, Lee S-H, Hwang W (2021) Correction to: active weighted mapping-based residual convolutional neural network for image classification. Multimedia Tools Appl 80:1–1. https://doi.org/10.1007/s11042-021-11538-z
Sánchez J, Mollineda R, Sotoca J (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10:189–201. https://doi.org/10.1007/s10044-007-0061-2
Article MathSciNet Google Scholar
Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831. https://doi.org/10.1016/j.eswa.2013.02.025
Article Google Scholar
Sarbazi-Azad S, Saniee Abadeh M, Mowlaei ME (2021) Using data complexity measures and an evolutionary cultural algorithm for gene selection in microarray data. Soft Comput Lett 3:100007. https://doi.org/10.1016/j.socl.2020.100007
Article Google Scholar
Sáez JA, Luengo J, Herrera F (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364. https://doi.org/10.1016/j.patcog.2012.07.009
Article Google Scholar
Jain S, Shukla S, Wadhvani R (2018) Dynamic selection of normalization techniques using data complexity measures. Expert Syst Appl 106:252–262. https://doi.org/10.1016/j.eswa.2018.04.008
Article Google Scholar
Thudumu S, Branch P, Jin J, Singh J (2020) A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data 7. https://doi.org/10.1186/s40537-020-00320-x
Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539. https://doi.org/10.1109/TPAMI.2003.1251146
Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936. https://doi.org/10.1007/s00500-010-0625-8
B S, Gyanchandani M, Wadhvani R, Shukla S, (2023) Data complexity-based dynamic ensembling of svms in classification. Expert Syst Appl 216:119437. https://doi.org/10.1016/j.eswa.2022.119437
Moshtari S, Sami A, Azimi M (2013) Using complexity metrics to improve software security. Comput Fraud Secur 2013:8–17. https://doi.org/10.1016/S1361-3723(13)70045-9
Article Google Scholar
Sabeti V, Samavi S, Shirani S (2012) An adaptive lsb matching steganography based on octonary complexity measure. Multimedia Tools Appl 64. https://doi.org/10.1007/s11042-011-0975-y
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Syst 117:27–45. https://doi.org/10.1016/j.knosys.2016.09.022. (Volume, Variety and Velocity in Data Science)
Dikmen I, Atasoy G, Erol H, Kaya HD, Birgonul MT (2022) A decision-support tool for risk and complexity assessment and visualization in construction projects. Comput Ind 141:103694. https://doi.org/10.1016/j.compind.2022.103694
Article Google Scholar
Saini M, Susan S (2023) Tackling class imbalance in computer vision: a contemporary review. Art Intell Rev 1–57. https://doi.org/10.1007/s10462-023-10557-6
Acaroğlu H, García Márquez FP (2021) Comprehensive review on electricity market price and load forecasting based on wind energy. Energ 14(22). https://doi.org/10.3390/en14227473
Kim YS (2010) Performance evaluation for classification methods: a comparative simulation study. Expert Syst Appl 37(3):2292–2306. https://doi.org/10.1016/j.eswa.2009.07.043
Ho TK (2008) Data Complexity Analysis: linkage between context and solution in classification. Springer-Verlag, Berlin Heidelberg 2008:1–1. https://doi.org/10.1007/978-3-540-89689-0_1
Ho TK, Basu M (2000) Measuring the complexity of classification problems. Proc - Int Conf Pattern Recognit 15(2):43–46. https://doi.org/10.1109/icpr.2000.906015
Article Google Scholar
Ho TK, Baird HS (1998) Pattern classification with compact distribution maps. Comput Vision Image Understand 70(1):101–110. https://doi.org/10.1006/cviu.1998.0624
Smith FW (1968) Pattern classifier design by linear programming. IEEE Trans Comput C-17(4):367–372. https://doi.org/10.1109/TC.1968.229395
Hoekstra A, Duin RPW (1996) On the nonlinearity of pattern classifiers. In: Proc. of the 13th ICPR. pp 271–275
Smith SP, Jain AK (1988) A test to determine the multivariate normality of a data set. IEEE Trans Pattern Anal Mach Intell 10(5):757–761. https://doi.org/10.1109/34.6789
Article Google Scholar
Frank L, Hubert E (1996) Pretopological approach for supervised learning. In: Proceedings of 13th international conference on pattern recognition, vol 4. pp 256–2604. https://doi.org/10.1109/ICPR.1996.547426
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA
Cortes C, Vapnik V (1995) Support-vector networks. Mach Leaming 20:273–297. https://doi.org/10.1007/BF00994018
Article Google Scholar
Meyer D, Leisch F, Hornik K (2003) The support vector machine under test. Neurocomput 55(1–2):169–186. https://doi.org/10.1016/S0925-2312(03)00431-4
Article Google Scholar
Ougiaroglou S, Nanopoulos A, Papadopoulos A, Manolopoulos Y, Welzer T (2007) Adaptive k-nearest-neighbor classification using a dynamic number of nearest neighbors. pp 66–82. https://doi.org/10.1007/978-3-540-75185-4_7
Schneider K-M (2003) A comparison of event models for Naive Bayes anti-spam e-mail filtering. 307. https://doi.org/10.3115/1067807.1067848
Quinlan JR (1986) Induction of Decision Trees. Mach Learn 1(1):81–106. https://doi.org/10.1023/A:1022643204877
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Elbashir MK, Wang J, Wu F-X, Wang L (2013) Predicting beta-turns in proteins using support vector machines with fractional polynomials. Proteome Sci 11(1):1–10. https://doi.org/10.1186/1477-5956-11-S1-S5
Article Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/bf00058655
Article Google Scholar
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals Stat 29:1189–1232
Article MathSciNet Google Scholar

Download references

Author information

Manasi Gyanchandani, Rajesh Wadhvani and Sanyam Shukla contributed equally to this work.

Authors and Affiliations

Computer Science Department, MANIT, Bhopal, 462003, MadhyaPradesh, India
Sowkarthika B, Manasi Gyanchandani, Rajesh Wadhvani & Sanyam Shukla

Authors

Sowkarthika B
View author publications
You can also search for this author in PubMed Google Scholar
Manasi Gyanchandani
View author publications
You can also search for this author in PubMed Google Scholar
Rajesh Wadhvani
View author publications
You can also search for this author in PubMed Google Scholar
Sanyam Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sowkarthika B.

Ethics declarations

Funding/Conflict Of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

B, S., Gyanchandani, M., Wadhvani, R. et al. Data complexity measures for classification of a multi-concept dataset. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18965-8

Download citation

Received: 21 June 2023
Revised: 11 February 2024
Accepted: 13 March 2024
Published: 08 April 2024
DOI: https://doi.org/10.1007/s11042-024-18965-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data complexity measures for classification of a multi-concept dataset

Abstract

Access this article

Similar content being viewed by others

A Complexity Measure for Binary Classification Problems Based on Lost Points

Hostility measure for multi-level study of data complexity

Local Feature Selection by Formal Concept Analysis for Multi-class Classification

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding/Conflict Of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data complexity measures for classification of a multi-concept dataset

Abstract

Access this article

Similar content being viewed by others

A Complexity Measure for Binary Classification Problems Based on Lost Points

Hostility measure for multi-level study of data complexity

Local Feature Selection by Formal Concept Analysis for Multi-class Classification

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Funding/Conflict Of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation