Skip to main content
Log in

KmL: k-means for longitudinal data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Cohort studies are becoming essential tools in epidemiological research. In these studies, measurements are not restricted to single variables but can be seen as trajectories. Statistical methods used to determine homogeneous patient trajectories can be separated into two families: model-based methods (like Proc Traj) and partitional clustering (non-parametric algorithms like k-means). KmL is a new implementation of k-means designed to work specifically on longitudinal data. It provides scope for dealing with missing values and runs the algorithm several times, varying the starting conditions and/or the number of clusters sought; its graphical interface helps the user to choose the appropriate number of clusters when the classic criterion is not efficient. To check KmL efficiency, we compare its performances to Proc Traj both on artificial and real data. The two techniques give very close clustering when trajectories follow polynomial curves. KmL gives much better results on non-polynomial trajectories.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abraham C, Cornillon P, Matzner-Lober E, Molinari N (2003) Unsupervised curve clustering using B-splines. Scand J Stat 30(3): 581–595

    Article  MATH  MathSciNet  Google Scholar 

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6): 716–723

    Article  MATH  MathSciNet  Google Scholar 

  • Atienza N, Garcìa-Heras J, Muñoz-Pichardo J, Villa R (2008) An application of mixture distributions in modelization of length of hospital stay. Stat Med 27: 1403–1420

    Article  MathSciNet  Google Scholar 

  • Beauchaine TP, Beauchaine RJ (2002) A Comparison of Maximum Covariance and K-Means Cluster Analysis in Classifying Cases Into Known Taxon Groups. Psychol Methods 7(2): 245–261

    Article  Google Scholar 

  • Bezdek J, Pal N (1998) Some new indexes of cluster validity. In: IEEE Transactions on Systems, Man and Cybernetics, Part B 28(3):301–315

  • Boik JC, Newman RA, Boik RJ (2008) Quantifying synergism/antagonism using nonlinear mixed-effects modeling: a simulation study. Stat Med 27(7): 1040–1061

    Article  MathSciNet  Google Scholar 

  • Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3(1): 1–27

    Article  MathSciNet  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EM algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3): 315–332

    Article  MATH  MathSciNet  Google Scholar 

  • Clark D, Jones B, Wood D, Cornelius J (2006) Substance use disorder trajectory classes: diachronic integration of onset age, severity, and course. Addict Behav 31(6): 995–1009

    Article  Google Scholar 

  • Conklin C, Perkins K, Sheidow A, Jones B, Levine M, Marcus M (2005) The return to smoking: 1-year relapse trajectories among female smokers. Nicotine & Tob Res 7(4): 533–540

    Article  Google Scholar 

  • D’Urso P (2004) Fuzzy C-means clustering models for multivariate time-varying data: different approaches. Int J Uncertain Fuzziness Knowl Base Syst 12(3): 287–326

    Article  MATH  MathSciNet  Google Scholar 

  • Everitt BS, Landau S, Leese M (2001) Cluster analysis. A Hodder Edwar Arnold Publication, London

    Google Scholar 

  • García-Escudero LA, Gordaliza A (2005) A proposal for robust curve clustering. J Classif 22(2): 185–201

    Article  Google Scholar 

  • Genolini C (2008) Kml. http://christophe.genolini.free.fr/kml/

  • Genolini C (2009) A (Not so) short introduction to S4. http://cran.r-project.org/

  • Goldstein H (1995) Multilevel statistical models. Edwar Arnold, London

    Google Scholar 

  • Gower J (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4): 857–871

    Article  Google Scholar 

  • Hand D, Krzanowski W (2005) Optimising k-means clustering results with standard software packages. Comput Stat Data Anal 49(4): 969–973

    Article  MATH  MathSciNet  Google Scholar 

  • Hartigan J (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  • Hunt L, Jorgensen M (2003) Mixture model clustering for mixed data with missing information. Comput Stat Data Anal 41(3–4): 429–440

    Article  MathSciNet  Google Scholar 

  • James G, Sugar C (2003) Clustering for sparsely sampled functional data. J Am Stat Assoc 98(462): 397–408

    Article  MATH  MathSciNet  Google Scholar 

  • Jones BL (2001) Proc traj. http://www.andrew.cmu.edu/user/bjones/

  • Jones BL, Nagin DS (2007) Advances in group-based trajectory modeling and an SAS procedure for estimating them. Sociol Methods & Res 35(4): 542

    Article  MathSciNet  Google Scholar 

  • Jones BL, Nagin DS, Roeder K (2001) A SAS procedure based on mixture models for estimating developmental trajectories. Sociological Methods & Research 29(3): 374

    Article  MathSciNet  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster Analysis. Wiley, New York

    Google Scholar 

  • Košmelj K, Batagelj V (1990) Cross-sectional approach for clustering time varying data. J Classif 7(1): 99–109

    Article  Google Scholar 

  • Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137

    Article  MATH  MathSciNet  Google Scholar 

  • Lu Y, Lu S, Fotouhi F, Deng Y, Brown SJ (2004) Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics 5:172. http://www.biomedcentral.com/1471-2105/5/172

    Google Scholar 

  • Magidson J, Vermunt JK (2002) Latent class models for clustering: a comparison with k-means. Can J Mark Res 20: 37–44

    Google Scholar 

  • Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2002.1114856

    Google Scholar 

  • Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2): 159–179

    Article  Google Scholar 

  • Muthén L, Muthén B (1998) Mplus user’s guide. Muthén & Muthén 2006, Los Angeles

    Google Scholar 

  • Nagin DS (2005) Group-based modeling of development. Harvard University Press, Cambridge

    Google Scholar 

  • Nagin DS, Tremblay RE (2001) Analyzing developmental trajectories of distinct but related behaviors: a group-based method. Psychol methods 6(1): 18–34

    Article  Google Scholar 

  • R Development Core Team (2009) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, ISBN 3-900051-07-0

  • Rossi F, Conan-Guez B, Golli AE (2004) Clustering functional data with the SOM algorithm. In: Proceedings of ESANN, pp 305–312

  • Ryan L (2008) Combining data from multiple sources, with applications to environmental risk assessment. Stat Med 27(5): 698–710

    Article  MathSciNet  Google Scholar 

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2): 461–464

    Article  MATH  Google Scholar 

  • Shim Y, Chung J, Choi I (2005) A comparison study of cluster validity indices using a nonhierarchical clustering algorithm. In: Proceedings of CIMCA-IAWTIC’05, IEEE computer society, Washington, vol 1, pp 199–204

  • Sugar C, James G (2003) Finding the number of clusters in a Dataset: an information-theoretic approach. J Am Stat Assoc 98(463): 750–764

    Article  MATH  MathSciNet  Google Scholar 

  • Tarpey T (2007) Linear transformations and the k-means clustering algorithm: applications to clustering curves. Am Stat 61(1): 34

    Article  MathSciNet  Google Scholar 

  • Tarpey T, Kinateder K (2003) Clustering functional data. J classif 20(1): 93–114

    Article  MATH  MathSciNet  Google Scholar 

  • Tokushige S, Yadohisa H, Inada K (2007) Crisp and fuzzy k-means clustering algorithms for multivariate functional data. Comput Stat 22(1): 1–16

    Article  MathSciNet  Google Scholar 

  • Tou JTL, Gonzalez RC (1974) Pattern recognition principles. Addison-Wesley, Reading

    MATH  Google Scholar 

  • Touchette E, Petit D, Seguin J, Boivin M, Tremblay R, Montplaisir J (2007) Associations between sleep duration patterns and behavioral/cognitive functioning at school entry. Sleep 30(9): 1213–1219

    Google Scholar 

  • Tremblay RE (2008) Prévenir la violence dès la petite enfance. Odile Jacob, Paris

    Google Scholar 

  • Vlachos M, Lin J, Keogh E, Gunopulos D (2003) A wavelet-based anytime algorithm for k-means clustering of time series. In: 3rd SIAM international conference on data mining. San Francisco, May 1–3, 2003, workshop on clustering high dimensionality data and its applications

  • Warren-Liao T (2005) Clustering of time series data-a survey. Pattern Recognit 38(11): 1857–1874

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christophe Genolini.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Genolini, C., Falissard, B. KmL: k-means for longitudinal data. Comput Stat 25, 317–328 (2010). https://doi.org/10.1007/s00180-009-0178-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-009-0178-4

Keywords

Navigation