Skip to main content
Log in

A cautionary note on using internal cross validation to select the number of clusters

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

A highly popular method for examining the stability of a data clustering is to split the data into two parts, cluster the observations in Part A, assign the objects in Part B to their nearest centroid in Part A, and then independently cluster the Part B objects. One then examines how close the two partitions are (say, by the Rand measure). Another proposal is to split the data into k parts, and see how their centroids cluster. By means of synthetic data analyses, we demonstrate that these approaches fail to identify the appropriate number of clusters, particularly as sample size becomes large and the variables exhibit higher correlations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Arabie, P., & Hubert, L. W. (1994). Cluster analysis in marketing research. In R. P. Bagozzi (Ed.),Advanced methods in marketing research (pp. 160–189). Oxford: Blackwell.

    Google Scholar 

  • Atlas, R. S., & Overall, J. E. (1994). comparative evaluation of two superior stopping rules for hierarchical cluster analysis.Psychometrika, 59, 581–591.

    Google Scholar 

  • Bradley, L. A., Prokop, C. K., Margolis, R., & Gentry, W. D. (1978). Multivariate analysis of MMPI profiles of low back pain patients.Journal of Behavioral Medicine, 1, 253–272.

    Google Scholar 

  • Breckenridge, J. N. (1989). Replicating cluster analysis: Method, consistency, and validity.Multivariate Behavioral Research, 24, 147–161.

    Google Scholar 

  • Calinski, R. B., & Harabasz, J. (1976). A dendrite method for cluster analysis.Communications in Statistics, 3, 1–27.

    Google Scholar 

  • Carroll, J.D. (1973). Howard-Harris clustering. In P. Green & Y. Wind (Eds.),Multivariate decisions in marketing (pp. 369–371). Hinsdale, IL: Dryden Press.

    Google Scholar 

  • Cyr, J. J., Atkinson, L., & Haley, G. A. (1986). A Replicated cluster solution in a heterogeneous psychiatric population.Journal of Clinical Psychology, 42, 92–94.

    Google Scholar 

  • Green, P. E., & Krieger, A. M. (1991). Segmenting markets with conjoint analysis.Journal of Marketing, 55, 20–31.

    Google Scholar 

  • Helsen, K., & Green, P. E. (1991). A Computational study of replicated clustering with an application to market segmentation.Decision Science, 22, 1124–1141.

    Google Scholar 

  • Hubert, L. J., & Arabie, P. (1985). Comparing partitions.Journal of Classification, 2, 193–218.

    Google Scholar 

  • Johnson, R. M. (1988, April).Convergent cluster analysis system. Unpublished manuscript. Sawtooth Software, Ketchum, ID.

    Google Scholar 

  • McIntyre, R. M., & Blashfield, R. K. (1980). A nearest-centroid technique for evaluating the minimum-variance clustering procedure.Multivariate Behavioral Research, 2, 225–238.

    Google Scholar 

  • Milligan G. W. (1980). An examination of the effect of six types of error perturbation on fifteen clustering algorithms.Psychometrika, 45, 325–342.

    Google Scholar 

  • Milligan, G. W. (1994). Issues in applied classification: replication analysis.CSNA Newsletter, 36, 5–6.

    Google Scholar 

  • Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set.Psychometrika, 50, 159–179.

    Google Scholar 

  • Milligan, G. W., & Cooper, M. C. (1987). Methodology review: Clustering methods.Applied Psychological Measurement, 11, 329–354.

    Google Scholar 

  • Overall, J. E., & Magee, K. N. (1992). Replication as a rule for determining the number of clusters in hierarchical cluster analysis.Applied Psychological Measurement, 16, 119–128.

    Google Scholar 

  • Punj, G. N., & Stewart, D. W. (1983). Cluster analysis in marketing research: Review and suggestions.Journal of Marketing Research, 20, 134–148.

    Google Scholar 

  • Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.Journal of the American Statistical Association, 58, 236–244.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Additional information

The authors express their thanks to the Sol C. Snider Entrepreneurial Center, Wharton School, for support of this project.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krieger, A.M., Green, P.E. A cautionary note on using internal cross validation to select the number of clusters. Psychometrika 64, 341–353 (1999). https://doi.org/10.1007/BF02294300

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02294300

Key words

Navigation