Abstract
In unsupervised learning tasks, one of the most significant and challenging aspects is how to estimate the optimal number of clusters (NC) for a particular set of data. Identifying NC in a given dataset is an essential criterion of cluster validity in clustering analysis. The purpose of cluster analysis is to group data points of similar characteristics, which helps determine distributions and correlations of patterns in large datasets. Recently, the availability and diversity of vast data have inspired researchers to identify an optimal NC in such data. In this paper, an ensemble approach is proposed called Ensemble Cluster Validity Index ECVI, to determine the optimal NC based on integrating and optimising several clustering validity indices, namely the Silhouette (Sil) index, the Davies–Bouldin (DB) index, the Calinski-Harabasz (CH) index, and the Gap statistic. The proposed ECVI aims to enhance the selection of the proper NC, which can be used as a measure of a dataset’s partitioning correctness to represent the actual structure of the dataset. The clustering solution (outcome) of the proposed ECVI is used as an input parameter for the k-means clustering algorithm. In other words, the proposed ECVI is concentrated to develop and validate an internal validity method in order to identify a suitable NC. The experimental comparison with the ground-truth labels for given datasets collected from the UCI repository demonstrates that the proposed ECVI outperforms and produces promising outcomes when finding the optimal ECVI in such datasets. The ECVI evaluates the clustering results obtained using a specific algorithm (e.g., k-means or affinity propagation) and identifies the optimal NC for twenty-two UCI datasets. The effectiveness of the proposed ECVI is illustrated by the theoretical analysis and then demonstrated by extensive experiments. ECVI was compared to fifteen recently published and state-of-the-art validity indices, including DB, SIL, CH, Gap, STR, EM with STR, K-means with STR, KL, Hart, Wint, IGP, Dunn, BWC, PBM, and SC indices. The experimental results show that ECVI surpasses all the compared indices in terms of the optimal NC and accuracy rate.
Similar content being viewed by others
References
Sowan B (2017) A comparative analysis of exam timetable using data mining techniques. IJCSNS 17(1):73
Renjith S, Sreekumar A, Jathavedan M (2020) Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Mater Today: Proc, 27
Ghassany M, Grozavu N, Bennani Y (2013) Collaborative multi-view clustering. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Khedairia S, Khadir M T (2019) A multiple clustering combination approach based on iterative voting process. Journal of King Saud University-Computer and Information Sciences, 34(1)
Galdi P, Serra A, Tagliaferri R (2016) Rotation clustering: a consensus clustering approach to cluster gene expression data. In: International workshop on fuzzy logic and applications. Springer, pp 229–238
Sowan B I, Dahal K P, Hossain A M, Alam M S (2010) Diversification of fuzzy association rules to improve prediction accuracy. In: International conference on fuzzy systems. IEEE, pp 1–8
Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18–24
Sowan B, Matar N, Omar F, Alauthman M, Eshtay M (2020) Evaluation of class decomposition based on clustering validity and k-means algorithm. In: 2020 21st International arab conference on information technology (ACIT). https://doi.org/10.1109/ACIT50332.2020.9300084, pp 1–6
Lee S -H, Jeong Y -S, Kim J -Y, Jeong M K (2018) A new clustering validity index for arbitrary shape of clusters. Pattern Recogn Lett 112:263–269
Zhou S, Liu F, Song W (2021) Estimating the optimal number of clusters via internal validity index. Neural Process Lett 53(2):1013–1034
Zhou S, Xu Z (2018) A novel internal validity index based on the cluster centre and the nearest neighbour cluster. Appl Soft Comput 71:78–88
Tardioli G, Kerrigan R, Oates M, O’Donnell J, Finn D P (2018) Identification of representative buildings and building groups in urban datasets using a novel pre-processing, classification, clustering and predictive modelling approach. Build Environ 140:90–106
Gupta A, Datta S, Das S (2018) Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recogn Lett 116:72–79
Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18
Wu W, Peng M (2017) A data mining approach combining k-means clustering with bagging neural network for short-term wind power forecasting. IEEE Internet Things J 4(4):979– 986
Ashfaq R A R, Wang X -Z, Huang J Z, Abbas H, He Y -L (2017) Fuzziness based semi-supervised learning approach for intrusion detection system. Inf Sci 378:484–497
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A Y, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Patil C, Baidari I (2019) Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng 4(2):132–140
Malika C, Ghazzali N, Boiteau V, Niknafs A (2014) Nbclust: an r package for determining the relevant number of clusters in a data set. J Stat Softw 61:1–36
Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18
Zhao Q, Fränti P (2014) Wb-index: a sum-of-squares based index for cluster validity. Data Knowl Eng 92:77–89
Akogul S, Erisoglu M (2017) An approach for determining the number of clusters in a model-based cluster analysis. Entropy 19(9):452
Li Q, Yue S, Wang Y, Ding M, Li J (2020) A new cluster validity index based on the adjustment of within-cluster distance. IEEE Access 8:202872–202885
Luna-Romera J M, García-gutiérrez J, Martínez-Ballesteros M, Riquelme Santos JC (2018) An approach to validity indices for clustering techniques in big data. Progr Artif Intell 7(2):81–94
Zhu E, Ma R (2018) An effective partitional clustering algorithm based on new clustering validity index. Appl Soft Comput 71:608–621
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27
Rousseeuw P J (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc: Ser B (Stat Methodol) 63(2):411–423
Dunn J C (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104
Bezdek J C, Pal N R (1995) Cluster validation with generalized dunn’s indices. In: Proceedings 1995 second New Zealand international two-stream conference on artificial neural networks and expert systems. IEEE, pp 190–193
Davies D L, Bouldin D W (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell (2):224–227
Chou C -H, Su M -C, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7(2):205–220
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Saha S, Bandyopadhyay S (2009) Performance evaluation of some symmetry-based cluster validity indexes. IEEE Trans Syst Man Cybern Part C (Appl Rev) 39(4):420–425
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21
Starczewski A (2017) A new validity index for crisp clusters. Pattern Anal Appl 20(3):687–700
Hartigan J A (1985) Statistical theory in clustering. J Classif 2(1):63–76
Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining. The University of Texas at Austin
Zhou S, Xu Z, Tang X (2011) Comparative study on method for determining optimal number of clusters based on affinity propagation clustering. Comput Sci, 38(2)
Kapp A V, Tibshirani R (2007) Are clusters found in one dataset present in another dataset? Biostatistics 8(1):9–31
Zhao Y, Guo Y, Sun R, Liu Z, Guo D (2020) Unsupervised video summarization via clustering validity index. Multimed Tools Appl 79(45):33417–33430
Pakhira M K, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recognit 37(3):487–501
Xie X L, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847
Vendramin L, Campello R J, Hruschka E R (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min: The ASA Data Science Journal 3(4):209–235
Capó M, Pérez A, Lozano J A (2020) An efficient k-means clustering algorithm for tall data. Data Min Knowl Disc 1–36
Hancer E, Karaboga D (2017) A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm Evol Comput 32:49– 67
Sharma C, Ojha C (2020) Statistical parameters of hydrometeorological variables: standard deviation, snr, skewness and kurtosis. In: Advances in water resources engineering and management. Springer, pp 59–70
Das P, Das A K (2019) Graph-based clustering of extracted paraphrases for labelling crime reports. Knowl-Based Syst 179:55– 76
Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed 1 Sept 2021
Acknowledgments
This work is supported by Deanship of Scientific Research and Graduate Studies at University of Petra, Amman, Jordan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sowan, B., Hong, TP., Al-Qerem, A. et al. Ensembling validation indices to estimate the optimal number of clusters. Appl Intell 53, 9933–9957 (2023). https://doi.org/10.1007/s10489-022-03939-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-03939-w