Skip to main content
Log in

DHCC: Divisive hierarchical clustering of categorical data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC. We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA). We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First, due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abdi H, Valentin D (2007) Multiple correspondence analysis. In: Saltkind N (eds) Encyclopedia of measurement and statistics. Sage, Thousand Oaks

    Google Scholar 

  • Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, PA, USA, pp 61–72

  • Andritsos P, Tsaparas P, Miller R, Sevcik K (2004) LIMBO: scalable clustering of categorical data. Lecture notes in computer science. Springer, New York

    Google Scholar 

  • Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM international conference on information and knowledge management, McLean, VA, USA, pp 582–589

  • Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21: 507–522

    Article  Google Scholar 

  • Brand M (2006) Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl 415: 20–30

    Article  MATH  MathSciNet  Google Scholar 

  • Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19: 1607–1624

    Article  Google Scholar 

  • Chen K, Liu L (2005) The ‘best k’ for entropy-based categorical data clustering. In: Proceedings of the 17th international conference on scientific and statistical database management, Santa Barbara, CA, USA, pp 253–262

  • Chen HL, Chuang KT, Chen MS (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20: 1458–1472

    Article  Google Scholar 

  • Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, USA, pp 220–229

  • Ding C, He X (2002) Cluster merging and splitting in hierarchical clustering algorithms. In: Proceedings of the 2nd IEEE international conference on data mining, Maebashi, Japan, pp 139–146

  • Do H, Kim J (2008) Categorical data clustering using the combinations of attribute values. Lecture notes in computer science. Springer, New York

    Google Scholar 

  • Drineas P, Drinea E, Huggins P (2003) An experimental evaluation of a Monte-Carlo algorithm for singular value decomposition. Lecture notes in computer science. Springer, New York

    Google Scholar 

  • Everitt B, Landau S, Leese M (2001) Cluster analysis 4. Arnold Publishers, London

    MATH  Google Scholar 

  • Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. SIGKDD Explor 6(2): 87–94

    Article  Google Scholar 

  • Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, pp 73–83

  • Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min Knowl Discov 6(4): 303–360

    Article  MathSciNet  Google Scholar 

  • Greenacre M (1993) Correspondence analysis in practice. Academic Press, London

    Google Scholar 

  • Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman & Hall, London

    Book  MATH  Google Scholar 

  • Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th IEEE international conference on data engineering, Sydney, Australia, pp 512–521

  • Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical value. Data Mining Knowl Discov 2: 283–304

    Article  Google Scholar 

  • Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5): 657–668

    Article  Google Scholar 

  • Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  • Jin R, Breitbart Y, Muoh C (2009) Data discretization unification. Knowl Inf Syst 19: 1–29

    Article  Google Scholar 

  • Keogh E, Lonardi S, Ratanamahatana C (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA, pp 206–215

  • Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of the 21st international conference on machine learning, Banff, Alberta, Canada, pp 536–543

  • Lu Y, Wang S, Li S, Zhou C (2011) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach Learn 82: 43–70

    Article  Google Scholar 

  • Messaoud R, Boussaid O, Rabaseda S (2006) Efficient multidimensional data representations based on multiple correspondence analysis. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA, pp 662–667

  • Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105

    Article  Google Scholar 

  • Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, Boca Raton, FL, USA, pp 576–584

  • San O, Huynh V, Nakamori Y (2004) An alternative extension of the k-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2): 241–247

    MATH  MathSciNet  Google Scholar 

  • Sun H, Wang S, Jiang Q (2004) FCM-based model selection algorithm for determining the number of clusters. Pattern Recognit 37(10): 2027–2037

    Article  MATH  Google Scholar 

  • Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading

    Google Scholar 

  • Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceeding of the 10th ACM conference on information and knowledge management (CIKM), Kansas City, MO, USA, pp 483–490

  • Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, pp 877–885

  • Xiong T, Wang S, Mayers A, Monga E (2008) Personal bankruptcy prediction using sequence mining. In: Proceeding of KDD2008 workshop on data mining for business applications, Las Vegas, NV, USA, pp 32–38

  • Xiong T, Wang S, Mayers A, Monga E (2009) A new MCA-based divisive hierarchical algorithm for clustering categorical data. In: Proceedings of the 9th IEEE international conference on data mining, Miami, FL, USA, pp 1058–1063

  • Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceeding of the 17th ACM conference on information and knowledge management (CIKM), Arlington, Virginia, USA, pp 367–376

  • Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, pp 682–687

  • Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Mining Knowl Discov 10: 141–168

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tengke Xiong.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiong, T., Wang, S., Mayers, A. et al. DHCC: Divisive hierarchical clustering of categorical data. Data Min Knowl Disc 24, 103–135 (2012). https://doi.org/10.1007/s10618-011-0221-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-011-0221-2

Keywords

Navigation