Skip to main content
Log in

Data complexity measures for classification of a multi-concept dataset

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Classification algorithms design predictive models that classify data under one of the predefined categories. The data can be text, image, audio, video, or animation. The Data Complexity Metrics (DCM) gives insight into the different aspects of the data characteristics like data distribution, noise, overlap, and separability. Existing data complexity metrics consider linearly separable datasets to be less complex. However, for a separable multi-concept dataset, these metrics fail to generate low complexity values despite the high accuracy yielded by most of the state-of-the-art classification models. The erroneous complexity estimates generated by the existing metrics make it difficult to choose the efficient classifier model or fine-tune its parameter settings. This work addresses the complexity-accuracy discrepancy issue by formulating novel data complexity metrics for multi-concept and simple datasets. A density-based clustering algorithm (OPTICS) is initially utilized to identify the concepts in each class, which is later utilized in the formulation of the DCM in multi-concept datasets. This work also explores the relationship between the metrics and classifiers in multi-concept datasets. The classifiers’ accuracy for different data complexity levels is studied to elect a highly accurate, robust classifier for any dataset. Comprehensive coverage of the proposed technique in different data distributions with varied degrees of overlapping scenarios is examined in this paper using synthetic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Algorithm 2
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

Datasets were generated synthetically and the data-generating code can be made available on request.

References

  1. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300. https://doi.org/10.1109/34.990132

    Article  Google Scholar 

  2. Alzubaidi L, Fadhel M, Al-Shamma O, Zhang J, Santamaría J, Duan Y (2022) Robust application of new deep learning tools: an experimental study in medical imaging. Multimedia Tools Appl 81. https://doi.org/10.1007/s11042-021-10942-9

  3. Ibrahim E, Shouman M, Torkey H, El-Sayed A (2021) Correction to: handling missing and outliers values by enhanced algorithms for an accurate diabetic classification system. Multimedia Tools Appl 80. https://doi.org/10.1007/s11042-021-10843-x

  4. Abedini M, Bijari A, Banirostam T (2020) Classification of Pima Indian diabetes dataset using ensemble of decision tree. Log Regression Neural Netw Ijarcce 9(7):1–4. https://doi.org/10.17148/ijarcce.2020.9701

  5. Aversano L, Bernardi ML, Cimitile M, Iammarino M, Macchia PE, Nettore IC, Verdone C (2021) Thyroid disease treatment prediction with machine learning approaches. Procedia Comput Sci 192:1031–1040. https://doi.org/10.1016/j.procs.2021.08.106

    Article  Google Scholar 

  6. Isaza C, Anaya K, Zavala De Paz JP, Leal J, Hernández-Ríos I, Mosquera-Artamonov J (2018) Image analysis and data mining techniques for classification of morphological and color features for seeds of the wild castor oil plant (ricinus communis l.). Multimedia Tools Appl 77. https://doi.org/10.1007/s11042-017-4438-y

  7. Ramos J, Nedjah N, Mourelle L, Gupta BB (2018) Visual data mining for crowd anomaly detection using artificial bacteria colony. Multimedia Tools Appl 77. https://doi.org/10.1007/s11042-017-5382-6

  8. Stanisavljevic Z, Nikolic B, Tartalja I, Milutinovic V (2013) A classification of elearning tools based on the applied multimedia. Multimedia Tools Appl 74. https://doi.org/10.1007/s11042-013-1802-4

  9. Wu Z, Dong Y-n, Qiu X, Jin J (2022) Online multimedia traffic classification from the qos perspective using deep learning. Comput Netw 204:108716. https://doi.org/10.1016/j.comnet.2021.108716

    Article  Google Scholar 

  10. Chen H, Zhang Z, Huang S, Hu J, Ni W, Liu J (2023) Textcnn-based ensemble learning model for japanese text multi-classification. Comput Electr Eng 109:108751. https://doi.org/10.1016/j.compeleceng.2023.108751

    Article  Google Scholar 

  11. Jung H, Lee R, Lee S-H, Hwang W (2021) Correction to: active weighted mapping-based residual convolutional neural network for image classification. Multimedia Tools Appl 80:1–1. https://doi.org/10.1007/s11042-021-11538-z

  12. Sánchez J, Mollineda R, Sotoca J (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Anal Appl 10:189–201. https://doi.org/10.1007/s10044-007-0061-2

    Article  MathSciNet  Google Scholar 

  13. Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831. https://doi.org/10.1016/j.eswa.2013.02.025

    Article  Google Scholar 

  14. Sarbazi-Azad S, Saniee Abadeh M, Mowlaei ME (2021) Using data complexity measures and an evolutionary cultural algorithm for gene selection in microarray data. Soft Comput Lett 3:100007. https://doi.org/10.1016/j.socl.2020.100007

    Article  Google Scholar 

  15. Sáez JA, Luengo J, Herrera F (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364. https://doi.org/10.1016/j.patcog.2012.07.009

    Article  Google Scholar 

  16. Jain S, Shukla S, Wadhvani R (2018) Dynamic selection of normalization techniques using data complexity measures. Expert Syst Appl 106:252–262. https://doi.org/10.1016/j.eswa.2018.04.008

    Article  Google Scholar 

  17. Thudumu S, Branch P, Jin J, Singh J (2020) A comprehensive survey of anomaly detection techniques for high dimensional big data. J Big Data 7. https://doi.org/10.1186/s40537-020-00320-x

  18. Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell 25(12):1534–1539. https://doi.org/10.1109/TPAMI.2003.1251146

  19. Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling. Soft Comput 15:1909–1936. https://doi.org/10.1007/s00500-010-0625-8

  20. B S, Gyanchandani M, Wadhvani R, Shukla S, (2023) Data complexity-based dynamic ensembling of svms in classification. Expert Syst Appl 216:119437. https://doi.org/10.1016/j.eswa.2022.119437

  21. Moshtari S, Sami A, Azimi M (2013) Using complexity metrics to improve software security. Comput Fraud Secur 2013:8–17. https://doi.org/10.1016/S1361-3723(13)70045-9

    Article  Google Scholar 

  22. Sabeti V, Samavi S, Shirani S (2012) An adaptive lsb matching steganography based on octonary complexity measure. Multimedia Tools Appl 64. https://doi.org/10.1007/s11042-011-0975-y

  23. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2017) Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Syst 117:27–45. https://doi.org/10.1016/j.knosys.2016.09.022. (Volume, Variety and Velocity in Data Science)

  24. Dikmen I, Atasoy G, Erol H, Kaya HD, Birgonul MT (2022) A decision-support tool for risk and complexity assessment and visualization in construction projects. Comput Ind 141:103694. https://doi.org/10.1016/j.compind.2022.103694

    Article  Google Scholar 

  25. Saini M, Susan S (2023) Tackling class imbalance in computer vision: a contemporary review. Art Intell Rev 1–57. https://doi.org/10.1007/s10462-023-10557-6

  26. Acaroğlu H, García Márquez FP (2021) Comprehensive review on electricity market price and load forecasting based on wind energy. Energ 14(22). https://doi.org/10.3390/en14227473

  27. Kim YS (2010) Performance evaluation for classification methods: a comparative simulation study. Expert Syst Appl 37(3):2292–2306. https://doi.org/10.1016/j.eswa.2009.07.043

  28. Ho TK (2008) Data Complexity Analysis: linkage between context and solution in classification. Springer-Verlag, Berlin Heidelberg 2008:1–1. https://doi.org/10.1007/978-3-540-89689-0_1

  29. Ho TK, Basu M (2000) Measuring the complexity of classification problems. Proc - Int Conf Pattern Recognit 15(2):43–46. https://doi.org/10.1109/icpr.2000.906015

    Article  Google Scholar 

  30. Ho TK, Baird HS (1998) Pattern classification with compact distribution maps. Comput Vision Image Understand 70(1):101–110. https://doi.org/10.1006/cviu.1998.0624

  31. Smith FW (1968) Pattern classifier design by linear programming. IEEE Trans Comput C-17(4):367–372. https://doi.org/10.1109/TC.1968.229395

  32. Hoekstra A, Duin RPW (1996) On the nonlinearity of pattern classifiers. In: Proc. of the 13th ICPR. pp 271–275

  33. Smith SP, Jain AK (1988) A test to determine the multivariate normality of a data set. IEEE Trans Pattern Anal Mach Intell 10(5):757–761. https://doi.org/10.1109/34.6789

    Article  Google Scholar 

  34. Frank L, Hubert E (1996) Pretopological approach for supervised learning. In: Proceedings of 13th international conference on pattern recognition, vol 4. pp 256–2604. https://doi.org/10.1109/ICPR.1996.547426

  35. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA

  36. Cortes C, Vapnik V (1995) Support-vector networks. Mach Leaming 20:273–297. https://doi.org/10.1007/BF00994018

    Article  Google Scholar 

  37. Meyer D, Leisch F, Hornik K (2003) The support vector machine under test. Neurocomput 55(1–2):169–186. https://doi.org/10.1016/S0925-2312(03)00431-4

    Article  Google Scholar 

  38. Ougiaroglou S, Nanopoulos A, Papadopoulos A, Manolopoulos Y, Welzer T (2007) Adaptive k-nearest-neighbor classification using a dynamic number of nearest neighbors. pp 66–82. https://doi.org/10.1007/978-3-540-75185-4_7

  39. Schneider K-M (2003) A comparison of event models for Naive Bayes anti-spam e-mail filtering. 307. https://doi.org/10.3115/1067807.1067848

  40. Quinlan JR (1986) Induction of Decision Trees. Mach Learn 1(1):81–106. https://doi.org/10.1023/A:1022643204877

    Article  Google Scholar 

  41. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  42. Elbashir MK, Wang J, Wu F-X, Wang L (2013) Predicting beta-turns in proteins using support vector machines with fractional polynomials. Proteome Sci 11(1):1–10. https://doi.org/10.1186/1477-5956-11-S1-S5

    Article  Google Scholar 

  43. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. https://doi.org/10.1007/bf00058655

    Article  Google Scholar 

  44. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Annals Stat 29:1189–1232

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sowkarthika B.

Ethics declarations

Funding/Conflict Of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

B, S., Gyanchandani, M., Wadhvani, R. et al. Data complexity measures for classification of a multi-concept dataset. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-18965-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-18965-8

Keywords

Navigation