Abstract
A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.
Similar content being viewed by others
References
Arabie, P., & Hubert, L. W. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.),Advanced methods in marketing research (pp. 160–189). Oxford: Blackwell.
Atlas, R. S., & Overall, J. E. (1994). comparative evaluation of two superior stopping rules for hierarchical cluster analysis.Psychometrika, 59, 581–591.
Bradley, L. A., Prokop, C. K., Margolis, R., & Gentry, W. D. (1978). Multivariate analysis of MMPI profiles of low back pain patients.Journal of Behavioral Medicine, 1, 253–272.
Breckenridge, J. N. (1989). Replicating cluster analysis: Method, consistency, and validity.Multivariate Behavioral Research, 24, 147–161.
Calinski, R. B., & Harabasz, J. (1976). A dendrite method for cluster analysis.Communications in Statistics, 3, 1–27.
Carroll, J.D. (1973). Howard-Harris clustering. In P. Green & Y. Wind (Eds.),Multivariate decisions in marketing (pp. 369–371). Hinsdale, IL: Dryden Press.
Cyr, J. J., Atkinson, L., & Haley, G. A. (1986). A Replicated cluster solution in a heterogeneous psychiatric population.Journal of Clinical Psychology, 42, 92–94.
Green, P. E., & Krieger, A. M. (1991). Segmenting markets with conjoint analysis.Journal of Marketing, 55, 20–31.
Helsen, K., & Green, P. E. (1991). A Computational study of replicated clustering with an application to market segmentation.Decision Science, 22, 1124–1141.
Hubert, L. J., & Arabie, P. (1985). Comparing partitions.Journal of Classification, 2, 193–218.
Johnson, R. M. (1988, April).Convergent cluster analysis system. Unpublished manuscript. Sawtooth Software, Ketchum, ID.
McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure.Multivariate Behavioral Research, 2, 225–238.
Milligan G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45, 325–342.
Milligan, G. W. (1994). Issues in applied classification: replication analysis.CSNA Newsletter, 36, 5–6.
Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50, 159–179.
Milligan, G. W., & Cooper, M. C. (1987). Methodology review: Clustering methods.Applied Psychological Measurement, 11, 329–354.
Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of clusters in hierarchical cluster analysis.Applied Psychological Measurement, 16, 119–128.
Punj, G. N., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions.Journal of Marketing Research, 20, 134–148.
Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.Journal of the American Statistical Association, 58, 236–244.
Author information
Authors and Affiliations
Additional information
The authors express their thanks to the Sol C. Snider Entrepreneurial Center, Wharton School, for support of this project.
Rights and permissions
About this article
Cite this article
Krieger, A.M., Green, P.E. A cautionary note on using internal cross validation to select the number of clusters. Psychometrika 64, 341–353 (1999). https://doi.org/10.1007/BF02294300
Received:
Revised:
Issue Date:
DOI: https://doi.org/10.1007/BF02294300