Skip to main content
Log in

Ensembling validation indices to estimate the optimal number of clusters

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In unsupervised learning tasks, one of the most significant and challenging aspects is how to estimate the optimal number of clusters (NC) for a particular set of data. Identifying NC in a given dataset is an essential criterion of cluster validity in clustering analysis. The purpose of cluster analysis is to group data points of similar characteristics, which helps determine distributions and correlations of patterns in large datasets. Recently, the availability and diversity of vast data have inspired researchers to identify an optimal NC in such data. In this paper, an ensemble approach is proposed called Ensemble Cluster Validity Index ECVI, to determine the optimal NC based on integrating and optimising several clustering validity indices, namely the Silhouette (Sil) index, the Davies–Bouldin (DB) index, the Calinski-Harabasz (CH) index, and the Gap statistic. The proposed ECVI aims to enhance the selection of the proper NC, which can be used as a measure of a dataset’s partitioning correctness to represent the actual structure of the dataset. The clustering solution (outcome) of the proposed ECVI is used as an input parameter for the k-means clustering algorithm. In other words, the proposed ECVI is concentrated to develop and validate an internal validity method in order to identify a suitable NC. The experimental comparison with the ground-truth labels for given datasets collected from the UCI repository demonstrates that the proposed ECVI outperforms and produces promising outcomes when finding the optimal ECVI in such datasets. The ECVI evaluates the clustering results obtained using a specific algorithm (e.g., k-means or affinity propagation) and identifies the optimal NC for twenty-two UCI datasets. The effectiveness of the proposed ECVI is illustrated by the theoretical analysis and then demonstrated by extensive experiments. ECVI was compared to fifteen recently published and state-of-the-art validity indices, including DB, SIL, CH, Gap, STR, EM with STR, K-means with STR, KL, Hart, Wint, IGP, Dunn, BWC, PBM, and SC indices. The experimental results show that ECVI surpasses all the compared indices in terms of the optimal NC and accuracy rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Sowan B (2017) A comparative analysis of exam timetable using data mining techniques. IJCSNS 17(1):73

    Google Scholar 

  2. Renjith S, Sreekumar A, Jathavedan M (2020) Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Mater Today: Proc, 27

  3. Ghassany M, Grozavu N, Bennani Y (2013) Collaborative multi-view clustering. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  4. Khedairia S, Khadir M T (2019) A multiple clustering combination approach based on iterative voting process. Journal of King Saud University-Computer and Information Sciences, 34(1)

  5. Galdi P, Serra A, Tagliaferri R (2016) Rotation clustering: a consensus clustering approach to cluster gene expression data. In: International workshop on fuzzy logic and applications. Springer, pp 229–238

  6. Sowan B I, Dahal K P, Hossain A M, Alam M S (2010) Diversification of fuzzy association rules to improve prediction accuracy. In: International conference on fuzzy systems. IEEE, pp 1–8

  7. Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18–24

    Google Scholar 

  8. Sowan B, Matar N, Omar F, Alauthman M, Eshtay M (2020) Evaluation of class decomposition based on clustering validity and k-means algorithm. In: 2020 21st International arab conference on information technology (ACIT). https://doi.org/10.1109/ACIT50332.2020.9300084, pp 1–6

  9. Lee S -H, Jeong Y -S, Kim J -Y, Jeong M K (2018) A new clustering validity index for arbitrary shape of clusters. Pattern Recogn Lett 112:263–269

    Article  Google Scholar 

  10. Zhou S, Liu F, Song W (2021) Estimating the optimal number of clusters via internal validity index. Neural Process Lett 53(2):1013–1034

    Article  Google Scholar 

  11. Zhou S, Xu Z (2018) A novel internal validity index based on the cluster centre and the nearest neighbour cluster. Appl Soft Comput 71:78–88

    Article  Google Scholar 

  12. Tardioli G, Kerrigan R, Oates M, O’Donnell J, Finn D P (2018) Identification of representative buildings and building groups in urban datasets using a novel pre-processing, classification, clustering and predictive modelling approach. Build Environ 140:90–106

    Article  Google Scholar 

  13. Gupta A, Datta S, Das S (2018) Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recogn Lett 116:72–79

    Article  Google Scholar 

  14. Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18

    Google Scholar 

  15. Wu W, Peng M (2017) A data mining approach combining k-means clustering with bagging neural network for short-term wind power forecasting. IEEE Internet Things J 4(4):979– 986

    Article  Google Scholar 

  16. Ashfaq R A R, Wang X -Z, Huang J Z, Abbas H, He Y -L (2017) Fuzziness based semi-supervised learning approach for intrusion detection system. Inf Sci 378:484–497

    Article  Google Scholar 

  17. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A Y, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279

    Article  Google Scholar 

  18. Patil C, Baidari I (2019) Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng 4(2):132–140

    Article  Google Scholar 

  19. Malika C, Ghazzali N, Boiteau V, Niknafs A (2014) Nbclust: an r package for determining the relevant number of clusters in a data set. J Stat Softw 61:1–36

    Google Scholar 

  20. Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18

    Google Scholar 

  21. Zhao Q, Fränti P (2014) Wb-index: a sum-of-squares based index for cluster validity. Data Knowl Eng 92:77–89

    Article  Google Scholar 

  22. Akogul S, Erisoglu M (2017) An approach for determining the number of clusters in a model-based cluster analysis. Entropy 19(9):452

    Article  Google Scholar 

  23. Li Q, Yue S, Wang Y, Ding M, Li J (2020) A new cluster validity index based on the adjustment of within-cluster distance. IEEE Access 8:202872–202885

    Article  Google Scholar 

  24. Luna-Romera J M, García-gutiérrez J, Martínez-Ballesteros M, Riquelme Santos JC (2018) An approach to validity indices for clustering techniques in big data. Progr Artif Intell 7(2):81–94

    Article  Google Scholar 

  25. Zhu E, Ma R (2018) An effective partitional clustering algorithm based on new clustering validity index. Appl Soft Comput 71:608–621

    Article  Google Scholar 

  26. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27

    Article  MathSciNet  MATH  Google Scholar 

  27. Rousseeuw P J (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  28. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc: Ser B (Stat Methodol) 63(2):411–423

    Article  MathSciNet  MATH  Google Scholar 

  29. Dunn J C (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104

    Article  MathSciNet  MATH  Google Scholar 

  30. Bezdek J C, Pal N R (1995) Cluster validation with generalized dunn’s indices. In: Proceedings 1995 second New Zealand international two-stream conference on artificial neural networks and expert systems. IEEE, pp 190–193

  31. Davies D L, Bouldin D W (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell (2):224–227

  32. Chou C -H, Su M -C, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7(2):205–220

    Article  MathSciNet  Google Scholar 

  33. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  34. Saha S, Bandyopadhyay S (2009) Performance evaluation of some symmetry-based cluster validity indexes. IEEE Trans Syst Man Cybern Part C (Appl Rev) 39(4):420–425

    Article  Google Scholar 

  35. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21

    Article  Google Scholar 

  36. Starczewski A (2017) A new validity index for crisp clusters. Pattern Anal Appl 20(3):687–700

    Article  MathSciNet  Google Scholar 

  37. Hartigan J A (1985) Statistical theory in clustering. J Classif 2(1):63–76

    Article  MathSciNet  MATH  Google Scholar 

  38. Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining. The University of Texas at Austin

  39. Zhou S, Xu Z, Tang X (2011) Comparative study on method for determining optimal number of clusters based on affinity propagation clustering. Comput Sci, 38(2)

  40. Kapp A V, Tibshirani R (2007) Are clusters found in one dataset present in another dataset? Biostatistics 8(1):9–31

    Article  MATH  Google Scholar 

  41. Zhao Y, Guo Y, Sun R, Liu Z, Guo D (2020) Unsupervised video summarization via clustering validity index. Multimed Tools Appl 79(45):33417–33430

    Article  Google Scholar 

  42. Pakhira M K, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recognit 37(3):487–501

    Article  MATH  Google Scholar 

  43. Xie X L, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847

    Article  Google Scholar 

  44. Vendramin L, Campello R J, Hruschka E R (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min: The ASA Data Science Journal 3(4):209–235

    Article  MathSciNet  MATH  Google Scholar 

  45. Capó M, Pérez A, Lozano J A (2020) An efficient k-means clustering algorithm for tall data. Data Min Knowl Disc 1–36

  46. Hancer E, Karaboga D (2017) A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm Evol Comput 32:49– 67

    Article  Google Scholar 

  47. Sharma C, Ojha C (2020) Statistical parameters of hydrometeorological variables: standard deviation, snr, skewness and kurtosis. In: Advances in water resources engineering and management. Springer, pp 59–70

  48. Das P, Das A K (2019) Graph-based clustering of extracted paraphrases for labelling crime reports. Knowl-Based Syst 179:55– 76

    Article  Google Scholar 

  49. Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed 1 Sept 2021

Download references

Acknowledgments

This work is supported by Deanship of Scientific Research and Graduate Studies at University of Petra, Amman, Jordan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bilal Sowan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sowan, B., Hong, TP., Al-Qerem, A. et al. Ensembling validation indices to estimate the optimal number of clusters. Appl Intell 53, 9933–9957 (2023). https://doi.org/10.1007/s10489-022-03939-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03939-w

Keywords

Navigation