Ensembling validation indices to estimate the optimal number of clusters

Sowan, Bilal; Hong, Tzung-Pei; Al-Qerem, Ahmad; Alauthman, Mohammad; Matar, Nasim

doi:10.1007/s10489-022-03939-w

Ensembling validation indices to estimate the optimal number of clusters

Published: 13 August 2022

Volume 53, pages 9933–9957, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Bilal Sowan ORCID: orcid.org/0000-0002-1933-4196¹,
Tzung-Pei Hong^2,3,
Ahmad Al-Qerem⁴,
Mohammad Alauthman⁵ &
…
Nasim Matar¹

431 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In unsupervised learning tasks, one of the most significant and challenging aspects is how to estimate the optimal number of clusters (NC) for a particular set of data. Identifying NC in a given dataset is an essential criterion of cluster validity in clustering analysis. The purpose of cluster analysis is to group data points of similar characteristics, which helps determine distributions and correlations of patterns in large datasets. Recently, the availability and diversity of vast data have inspired researchers to identify an optimal NC in such data. In this paper, an ensemble approach is proposed called Ensemble Cluster Validity Index ECVI, to determine the optimal NC based on integrating and optimising several clustering validity indices, namely the Silhouette (Sil) index, the Davies–Bouldin (DB) index, the Calinski-Harabasz (CH) index, and the Gap statistic. The proposed ECVI aims to enhance the selection of the proper NC, which can be used as a measure of a dataset’s partitioning correctness to represent the actual structure of the dataset. The clustering solution (outcome) of the proposed ECVI is used as an input parameter for the k-means clustering algorithm. In other words, the proposed ECVI is concentrated to develop and validate an internal validity method in order to identify a suitable NC. The experimental comparison with the ground-truth labels for given datasets collected from the UCI repository demonstrates that the proposed ECVI outperforms and produces promising outcomes when finding the optimal ECVI in such datasets. The ECVI evaluates the clustering results obtained using a specific algorithm (e.g., k-means or affinity propagation) and identifies the optimal NC for twenty-two UCI datasets. The effectiveness of the proposed ECVI is illustrated by the theoretical analysis and then demonstrated by extensive experiments. ECVI was compared to fifteen recently published and state-of-the-art validity indices, including DB, SIL, CH, Gap, STR, EM with STR, K-means with STR, KL, Hart, Wint, IGP, Dunn, BWC, PBM, and SC indices. The experimental results show that ECVI surpasses all the compared indices in terms of the optimal NC and accuracy rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

On strategies for building effective ensembles of relative clustering validity criteria

Article 21 June 2015

A Method to Determine the Number of Clusters Based on Multi-validity Index

A comprehensive study of clustering ensemble weighting based on cluster quality and diversity

Article 29 December 2017

References

Sowan B (2017) A comparative analysis of exam timetable using data mining techniques. IJCSNS 17(1):73
Google Scholar
Renjith S, Sreekumar A, Jathavedan M (2020) Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Mater Today: Proc, 27
Ghassany M, Grozavu N, Bennani Y (2013) Collaborative multi-view clustering. In: The 2013 international joint conference on neural networks (IJCNN). IEEE, pp 1–8
Khedairia S, Khadir M T (2019) A multiple clustering combination approach based on iterative voting process. Journal of King Saud University-Computer and Information Sciences, 34(1)
Galdi P, Serra A, Tagliaferri R (2016) Rotation clustering: a consensus clustering approach to cluster gene expression data. In: International workshop on fuzzy logic and applications. Springer, pp 229–238
Sowan B I, Dahal K P, Hossain A M, Alam M S (2010) Diversification of fuzzy association rules to improve prediction accuracy. In: International conference on fuzzy systems. IEEE, pp 1–8
Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18–24
Google Scholar
Sowan B, Matar N, Omar F, Alauthman M, Eshtay M (2020) Evaluation of class decomposition based on clustering validity and k-means algorithm. In: 2020 21st International arab conference on information technology (ACIT). https://doi.org/10.1109/ACIT50332.2020.9300084, pp 1–6
Lee S -H, Jeong Y -S, Kim J -Y, Jeong M K (2018) A new clustering validity index for arbitrary shape of clusters. Pattern Recogn Lett 112:263–269
Article Google Scholar
Zhou S, Liu F, Song W (2021) Estimating the optimal number of clusters via internal validity index. Neural Process Lett 53(2):1013–1034
Article Google Scholar
Zhou S, Xu Z (2018) A novel internal validity index based on the cluster centre and the nearest neighbour cluster. Appl Soft Comput 71:78–88
Article Google Scholar
Tardioli G, Kerrigan R, Oates M, O’Donnell J, Finn D P (2018) Identification of representative buildings and building groups in urban datasets using a novel pre-processing, classification, clustering and predictive modelling approach. Build Environ 140:90–106
Article Google Scholar
Gupta A, Datta S, Das S (2018) Fast automatic estimation of the number of clusters from the minimum inter-center distance for k-means clustering. Pattern Recogn Lett 116:72–79
Article Google Scholar
Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18
Google Scholar
Wu W, Peng M (2017) A data mining approach combining k-means clustering with bagging neural network for short-term wind power forecasting. IEEE Internet Things J 4(4):979– 986
Article Google Scholar
Ashfaq R A R, Wang X -Z, Huang J Z, Abbas H, He Y -L (2017) Fuzziness based semi-supervised learning approach for intrusion detection system. Inf Sci 378:484–497
Article Google Scholar
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A Y, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279
Article Google Scholar
Patil C, Baidari I (2019) Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng 4(2):132–140
Article Google Scholar
Malika C, Ghazzali N, Boiteau V, Niknafs A (2014) Nbclust: an r package for determining the relevant number of clusters in a data set. J Stat Softw 61:1–36
Google Scholar
Sowan B, Qattous H (2017) A data mining of supervised learning approach based on k-means clustering. Int J Comput Sci Netw Secur 17(1):18
Google Scholar
Zhao Q, Fränti P (2014) Wb-index: a sum-of-squares based index for cluster validity. Data Knowl Eng 92:77–89
Article Google Scholar
Akogul S, Erisoglu M (2017) An approach for determining the number of clusters in a model-based cluster analysis. Entropy 19(9):452
Article Google Scholar
Li Q, Yue S, Wang Y, Ding M, Li J (2020) A new cluster validity index based on the adjustment of within-cluster distance. IEEE Access 8:202872–202885
Article Google Scholar
Luna-Romera J M, García-gutiérrez J, Martínez-Ballesteros M, Riquelme Santos JC (2018) An approach to validity indices for clustering techniques in big data. Progr Artif Intell 7(2):81–94
Article Google Scholar
Zhu E, Ma R (2018) An effective partitional clustering algorithm based on new clustering validity index. Appl Soft Comput 71:608–621
Article Google Scholar
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Theory Methods 3(1):1–27
Article MathSciNet MATH Google Scholar
Rousseeuw P J (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc: Ser B (Stat Methodol) 63(2):411–423
Article MathSciNet MATH Google Scholar
Dunn J C (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104
Article MathSciNet MATH Google Scholar
Bezdek J C, Pal N R (1995) Cluster validation with generalized dunn’s indices. In: Proceedings 1995 second New Zealand international two-stream conference on artificial neural networks and expert systems. IEEE, pp 190–193
Davies D L, Bouldin D W (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell (2):224–227
Chou C -H, Su M -C, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Appl 7(2):205–220
Article MathSciNet Google Scholar
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
Article Google Scholar
Saha S, Bandyopadhyay S (2009) Performance evaluation of some symmetry-based cluster validity indexes. IEEE Trans Syst Man Cybern Part C (Appl Rev) 39(4):420–425
Article Google Scholar
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):1–21
Article Google Scholar
Starczewski A (2017) A new validity index for crisp clusters. Pattern Anal Appl 20(3):687–700
Article MathSciNet Google Scholar
Hartigan J A (1985) Statistical theory in clustering. J Classif 2(1):63–76
Article MathSciNet MATH Google Scholar
Strehl A (2002) Relationship-based clustering and cluster ensembles for high-dimensional data mining. The University of Texas at Austin
Zhou S, Xu Z, Tang X (2011) Comparative study on method for determining optimal number of clusters based on affinity propagation clustering. Comput Sci, 38(2)
Kapp A V, Tibshirani R (2007) Are clusters found in one dataset present in another dataset? Biostatistics 8(1):9–31
Article MATH Google Scholar
Zhao Y, Guo Y, Sun R, Liu Z, Guo D (2020) Unsupervised video summarization via clustering validity index. Multimed Tools Appl 79(45):33417–33430
Article Google Scholar
Pakhira M K, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recognit 37(3):487–501
Article MATH Google Scholar
Xie X L, Beni G (1991) A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 13(8):841–847
Article Google Scholar
Vendramin L, Campello R J, Hruschka E R (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Min: The ASA Data Science Journal 3(4):209–235
Article MathSciNet MATH Google Scholar
Capó M, Pérez A, Lozano J A (2020) An efficient k-means clustering algorithm for tall data. Data Min Knowl Disc 1–36
Hancer E, Karaboga D (2017) A comprehensive survey of traditional, merge-split and evolutionary approaches proposed for determination of cluster number. Swarm Evol Comput 32:49– 67
Article Google Scholar
Sharma C, Ojha C (2020) Statistical parameters of hydrometeorological variables: standard deviation, snr, skewness and kurtosis. In: Advances in water resources engineering and management. Springer, pp 59–70
Das P, Das A K (2019) Graph-based clustering of extracted paraphrases for labelling crime reports. Knowl-Based Syst 179:55– 76
Article Google Scholar
Dua D, Graff C (2017) UCI Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed 1 Sept 2021

Download references

Acknowledgments

This work is supported by Deanship of Scientific Research and Graduate Studies at University of Petra, Amman, Jordan.

Author information

Authors and Affiliations

Department of Business Intelligence and Data Analytics, University of Petra, Amman, Jordan
Bilal Sowan & Nasim Matar
Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan
Tzung-Pei Hong
Department of Computer Science, Zarqa University, Zarqa, Jordan
Ahmad Al-Qerem
Department of Information Security, University of Petra, Amman, Jordan
Mohammad Alauthman

Authors

Bilal Sowan
View author publications
You can also search for this author in PubMed Google Scholar
Tzung-Pei Hong
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Al-Qerem
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Alauthman
View author publications
You can also search for this author in PubMed Google Scholar
Nasim Matar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bilal Sowan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sowan, B., Hong, TP., Al-Qerem, A. et al. Ensembling validation indices to estimate the optimal number of clusters. Appl Intell 53, 9933–9957 (2023). https://doi.org/10.1007/s10489-022-03939-w

Download citation

Accepted: 24 June 2022
Published: 13 August 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10489-022-03939-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensembling validation indices to estimate the optimal number of clusters

Abstract

Access this article

Similar content being viewed by others

On strategies for building effective ensembles of relative clustering validity criteria

A Method to Determine the Number of Clusters Based on Multi-validity Index

A comprehensive study of clustering ensemble weighting based on cluster quality and diversity

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ensembling validation indices to estimate the optimal number of clusters

Abstract

Access this article

Similar content being viewed by others

On strategies for building effective ensembles of relative clustering validity criteria

A Method to Determine the Number of Clusters Based on Multi-validity Index

A comprehensive study of clustering ensemble weighting based on cluster quality and diversity

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation