A cautionary note on using internal cross validation to select the number of clusters

Krieger, Abba M.; Green, Paul E.

doi:10.1007/BF02294300

A cautionary note on using internal cross validation to select the number of clusters

Published: September 1999

Volume 64, pages 341–353, (1999)
Cite this article

Psychometrika Aims and scope Submit manuscript

Abba M. Krieger¹ &
Paul E. Green²

320 Accesses
24 Citations
Explore all metrics

Abstract

A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

Article 19 April 2016

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

References

Arabie, P., & Hubert, L. W. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.),Advanced methods in marketing research (pp. 160–189). Oxford: Blackwell.
Google Scholar
Atlas, R. S., & Overall, J. E. (1994). comparative evaluation of two superior stopping rules for hierarchical cluster analysis.Psychometrika, 59, 581–591.
Google Scholar
Bradley, L. A., Prokop, C. K., Margolis, R., & Gentry, W. D. (1978). Multivariate analysis of MMPI profiles of low back pain patients.Journal of Behavioral Medicine, 1, 253–272.
Google Scholar
Breckenridge, J. N. (1989). Replicating cluster analysis: Method, consistency, and validity.Multivariate Behavioral Research, 24, 147–161.
Google Scholar
Calinski, R. B., & Harabasz, J. (1976). A dendrite method for cluster analysis.Communications in Statistics, 3, 1–27.
Google Scholar
Carroll, J.D. (1973). Howard-Harris clustering. In P. Green & Y. Wind (Eds.),Multivariate decisions in marketing (pp. 369–371). Hinsdale, IL: Dryden Press.
Google Scholar
Cyr, J. J., Atkinson, L., & Haley, G. A. (1986). A Replicated cluster solution in a heterogeneous psychiatric population.Journal of Clinical Psychology, 42, 92–94.
Google Scholar
Green, P. E., & Krieger, A. M. (1991). Segmenting markets with conjoint analysis.Journal of Marketing, 55, 20–31.
Google Scholar
Helsen, K., & Green, P. E. (1991). A Computational study of replicated clustering with an application to market segmentation.Decision Science, 22, 1124–1141.
Google Scholar
Hubert, L. J., & Arabie, P. (1985). Comparing partitions.Journal of Classification, 2, 193–218.
Google Scholar
Johnson, R. M. (1988, April).Convergent cluster analysis system. Unpublished manuscript. Sawtooth Software, Ketchum, ID.
Google Scholar
McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure.Multivariate Behavioral Research, 2, 225–238.
Google Scholar
Milligan G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45, 325–342.
Google Scholar
Milligan, G. W. (1994). Issues in applied classification: replication analysis.CSNA Newsletter, 36, 5–6.
Google Scholar
Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50, 159–179.
Google Scholar
Milligan, G. W., & Cooper, M. C. (1987). Methodology review: Clustering methods.Applied Psychological Measurement, 11, 329–354.
Google Scholar
Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of clusters in hierarchical cluster analysis.Applied Psychological Measurement, 16, 119–128.
Google Scholar
Punj, G. N., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions.Journal of Marketing Research, 20, 134–148.
Google Scholar
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.Journal of the American Statistical Association, 58, 236–244.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, University of Pennsylvania, USA
Abba M. Krieger
Marketing Department, The Wharton School, University of Pennsylvania, 1400 Steinberg Hall-Dietrich Hall, 19104-6371, Philadelphia, PA
Paul E. Green

Authors

Abba M. Krieger
View author publications
You can also search for this author in PubMed Google Scholar
Paul E. Green
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

The authors express their thanks to the Sol C. Snider Entrepreneurial Center, Wharton School, for support of this project.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krieger, A.M., Green, P.E. A cautionary note on using internal cross validation to select the number of clusters. Psychometrika 64, 341–353 (1999). https://doi.org/10.1007/BF02294300

Download citation

Received: 02 April 1997
Revised: 29 September 1998
Issue Date: September 1999
DOI: https://doi.org/10.1007/BF02294300

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A cautionary note on using internal cross validation to select the number of clusters

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A Comprehensive Survey of Clustering Algorithms

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

A cautionary note on using internal cross validation to select the number of clusters

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A random forest guided tour

A Comprehensive Survey of Clustering Algorithms

References

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation