A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Jung, Yunjae; Park, Haesun; Du, Ding-Zhu; Drake, Barry L.

doi:10.1023/A:1021394316112

A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Published: January 2003

Volume 25, pages 91–111, (2003)
Cite this article

Journal of Global Optimization Aims and scope Submit manuscript

Yunjae Jung¹,
Haesun Park^2,3,
Ding-Zhu Du² &
…
Barry L. Drake⁴

1410 Accesses
124 Citations
Explore all metrics

Abstract

Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of different groups. Though many effective and efficient clustering algorithms have been developed and deployed, most of them still suffer from the lack of automatic or online decision for optimal number of clusters. In this paper, we define clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds. When the measure is applied to a hierarchical clustering algorithm, an optimal number of clusters can be found. Our clustering measure shows good performance producing intuitively reasonable clustering configurations in Euclidean space according to the evidence from experimental results. Furthermore, the measure can be utilized to estimate the desired number of clusters for partitional clustering methods as well. Therefore, the clustering gain measure provides a promising technique for achieving a higher level of quality for a wide range of clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Dongkuan Xu & Yingjie Tian

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

References

Gose, E., Johnsonbaugh, R. and Jost, S. (1996), Pattern Recognition & Image Analysis, Prentice Hall, Upper Saddle River, NJ.
Google Scholar
Jain, A.K., Murty, M.N. and Flynn, P.J. (1999), Data clustering: a review, ACM Computing Surveys 31, 264–323.
Google Scholar
Shaffer, E., Dubes, R. and Jain, A.K. (1979), Single-link characteristics of a mode-seeking algorithm, Pattern Recognition 11, 65–73.
Google Scholar
Kittler, J. (1976), A locally sensitive method for cluster analysis, Pattern Recognition 8, 22–33.
Google Scholar
Zahn, C.T. (1971), Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Transactions on Computers 20, 68–86.
Google Scholar
Urquhart, R. (1982), Graph theoretical clustering based on limited neighborhood sets, Pattern Recognition 15, 173–187.
Google Scholar
Gowdar, K.C. and Krishna, G. (1978), Agglomerative clustering using the concept for multispectral data, Pattern Recognition 10, 105–112.
Google Scholar
Anderberg, M.R. (1973), Cluster Analysis for Applications, Academic Press, New York.
Google Scholar
Abramowtiz, M. and Stegun, I.A. (1968), Handbook of Mathematical Functions with Formulas, Graphics and Mathematical Tables, US Govt. Printing Office, Washington, D.C.
Google Scholar
Fortier, J.J. and Solomon, H. (1966), Clustering Procedures, In Krishnaiah, P.R. (ed). Multivariate Analysis, Academic Press, New York. pp. 493–506.
Google Scholar
Jensen, R.E. (1967), A dynamic programming algorithm for cluster analysis, Operations Research 17, 1034–1057.
Google Scholar
Jain, A.K. and Dubes, R.C. (1988), Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Willet, P., (1988), Recent trends in hierarchic document clustering: a critical review, Information Processing and Management 24, 577–597.
Google Scholar
Jancey, R.C., (1966), Multidimensional group analysis, Austral. J. Botany 14, 127–130.
Google Scholar
MacQueen, J.B. (1967), Some methods for classification and analysis of multivariate observations. In: the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol.1, AD 669871, University of California Press, Berkeley, CA, pp. 281–297.
Google Scholar
Sibson, R. (1973), SLINK: an optimally efficient algorithm for the single-link cluster method, Computer Journal 16, 30–34.
Google Scholar
Defays, D. (1977), An efficient algorithm for a complete link method, Computer Journal 20, 364–366.
Google Scholar
Day, W.H.E. and Edelsbrunner, H. (1984), Efficient algorithms for agglomerative hierarchical clustering methods, Journal of Classification 1, 7–24.
Google Scholar
Ng, R.T. and Han G., J. (1994), Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th VLDB Conference, Santiago, Chile.
Voorhees, E.M. (1986), Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. In: Information Processing and Management Vol. 22, Pergamon, Oxford, pp. 465–476.
Google Scholar
Li, X. (1990), Parallel algorithms for hierarchical clustering and cluster validity, IEEE Transactions on Pattern Analysis and Machine Intelligence, 1088–1092.
Bradley, P., Fayyad, U. and Reina, C. (1998) Scaling clusterin algorithms to large databases, Knowledge Discovery and Data Mining.
Guha, S., Rastogi, R. and Shim, K. (1998), CURE: an efficient clustering algorithm for large databases. In: ACM-SIGMOD Int. Conf. on Management of Data, Seattle, WA. 73–84.
Guha, S., Rastogi, R. and Shim, K. (1999), ROCK: a robust clustering algorithm for categorical attributes. In: the 15th Int. Conf. on Data Eng.
Ganti, V., Ramakrishnan, R. and Gehrke, J. (2000), Clustering large datasets in arbitrary metric spaces. ACM.
Charikar, M., Chekuri, C., Feder, T. and Motwani, R. (1997), Incremental clustering and dynamic information retrieval. In: STOC '97, El Paso, TX, pp. 153–180.
Dubes, R.C. (1987), How many Clusters are best? – an experiment, Pattern Recognition 20, 645–663.
Google Scholar
Milligan, G.W. and Cooper M.C. (1985), An examination of procedures for detecting the number of clusters in a data set, Psychometrika 50, 159–179.
Google Scholar
Tou, J.T. and Gonzalez, R.C. (1974) Pattern Recognition Principles, Addison-Wesley, Miami, FL.
Google Scholar
Everett, B. (1975), Cluster analysis, Addison-Wesley, New York.
Google Scholar
Boley, D.L. (1998), Principal direction divisive partitioning, Data mining and knowledge discovery 2, 325–344.
Google Scholar
Mirkin, B. and Muchnik, I. (1998), Combinatorial Optimization in Clustering. In: Du, D.Z. and Pardalos, P.M. (eds), Handbook of Combinatorial Optimization, Kluwer Academic Publishers, Dordrecht pp. 261–329.
Google Scholar
Karypis, G., Han E.S. and Kumar, V. (1999), CHAMELEON: a hierarchical clustering algorithm using dynamic modeling, IEEE Computer: Special Issue on Data Analysis and Mining 32, 68–75.
Google Scholar
Duran, B.S. and Odell, P.L. (1977) Cluster Analysis: A Survey, Springer, Berlin.
Google Scholar
Diday, E. and Simon, J.C. (1976), Clustering Analysis. In: Fu, K.S. (ed), Digital Pattern Recognition, Springer, Secaucus, NJ, pp. 47–94.
Google Scholar
Garey, M.R. and Johnson, D.S. (1979), Computers and Intractability: a guide to the theory of NP-completeness, W.H.Freeman and Company, San Francisco, CA.
Google Scholar
Crescenzi, P. and Kann, V. (1995), A compendium of NP optimization problems, URL site:http://www.nada.kth.se/\({{\tilde v}}\)iggo/problemlist/compendium2.
Ward, Jr. J.H. (1963), Hierarchical grouping to optimize an objective function, Journal of the American Statical Association 58, 236–244.
Google Scholar
Forgy, E.W. (1965), Cluster analysis of multivariate data: efficiency versus interpretability of classification. In: Biometric Society Meetings, Reverside, CA, Abstract in Biometrics 21, 768.
Google Scholar
Sebestyen, G.S. (1962), Pattern recognition by an adaptive process of sample set construction. IRE Trans. on Info. Theory IT–8.
MacQueen, J.B. (1966), Some methods for classification and analysis of multivariate observations. In: Wester Management Science Inst., University of California, pp. 96, 1966
Ball, G.H. and Hall, D.J. (1964), Some fundamental concepts and synthesis procedures for pattern recognition preprocessors. In: International Conference on Microwaves, Circuit Theory, and Information Theory.
Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: an Introduction to Clustering Analysis, Academic Press, San Diego, CA.
Google Scholar
Rasmussen, E. (1992), Clustering Algorithms. In: Frakes, W.B. and Baeza-Yates, R. (eds), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, Upper Saddle River, NJ, pp. 419–442.
Google Scholar
Jardine, N. and Rijsbergen, C.J. (1971), The use of hierarchical clustering in information retrieval. Information Storage and Retrieval 7, 217–240.
Google Scholar
Bennett, R.S. (1966), The intrinsic dimensionality of signal collections, IEEE Transactions on Information Theory 15, 517–525.
Google Scholar
Zhang, T., Ramakrishnan, R. and Livny, M. (1996), BIRCH: An efficient data clustering method for very large databases, SIGMOD Rec. 25, 103–114.
Google Scholar
Porter, M.F. (1980), An Algorithm for Suffix Stripping, Program, 14, 130–137.
Google Scholar

Download references

Author information

Authors and Affiliations

Qwest Communications, 600 Stinson Blvd., Minneapolis, MN, 55413, USA
Yunjae Jung
Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA
Haesun Park & Ding-Zhu Du
Korea Institute for Advanced Study 207-43 Cheongryangri-dong, Dongdaemun-gu Seoul, 130-012, South Korea
Haesun Park
CDT, Inc., Minneapolis, MN, 55454, USA
Barry L. Drake

Authors

Yunjae Jung
View author publications
You can also search for this author in PubMed Google Scholar
Haesun Park
View author publications
You can also search for this author in PubMed Google Scholar
Ding-Zhu Du
View author publications
You can also search for this author in PubMed Google Scholar
Barry L. Drake
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jung, Y., Park, H., Du, DZ. et al. A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering. Journal of Global Optimization 25, 91–111 (2003). https://doi.org/10.1023/A:1021394316112

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1021394316112

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation