DHCC: Divisive hierarchical clustering of categorical data

Xiong, Tengke; Wang, Shengrui; Mayers, André; Monga, Ernest

doi:10.1007/s10618-011-0221-2

DHCC: Divisive hierarchical clustering of categorical data

Published: 13 May 2011

Volume 24, pages 103–135, (2012)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Tengke Xiong¹,
Shengrui Wang¹,
André Mayers¹ &
…
Ernest Monga²

948 Accesses
53 Citations
Explore all metrics

Abstract

Clustering categorical data poses two challenges defining an inherently meaningful similarity measure, and effectively dealing with clusters which are often embedded in different subspaces. In this paper, we propose a novel divisive hierarchical clustering algorithm for categorical data, named DHCC. We view the task of clustering categorical data from an optimization perspective, and propose effective procedures to initialize and refine the splitting of clusters. The initialization of the splitting is based on multiple correspondence analysis (MCA). We also devise a strategy for deciding when to terminate the splitting process. The proposed algorithm has five merits. First, due to its hierarchical nature, our algorithm yields a dendrogram representing nested groupings of patterns and similarity levels at different granularities. Second, it is parameter-free, fully automatic and, in particular, requires no assumption regarding the number of clusters. Third, it is independent of the order in which the data is processed. Fourth, it is scalable to large data sets. And finally, our algorithm is capable of seamlessly discovering clusters embedded in subspaces, thanks to its use of a novel data representation and Chi-square dissimilarity measures. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abdi H, Valentin D (2007) Multiple correspondence analysis. In: Saltkind N (eds) Encyclopedia of measurement and statistics. Sage, Thousand Oaks
Google Scholar
Aggarwal CC, Wolf JL, Yu PS, Procopiuc C, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the 1999 ACM SIGMOD international conference on management of data, Philadelphia, PA, USA, pp 61–72
Andritsos P, Tsaparas P, Miller R, Sevcik K (2004) LIMBO: scalable clustering of categorical data. Lecture notes in computer science. Springer, New York
Google Scholar
Barbara D, Li Y, Couto J (2002) COOLCAT: an entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM international conference on information and knowledge management, McLean, VA, USA, pp 582–589
Bouguessa M, Wang S (2009) Mining projected clusters in high-dimensional spaces. IEEE Trans Knowl Data Eng 21: 507–522
Article Google Scholar
Brand M (2006) Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl 415: 20–30
Article MATH MathSciNet Google Scholar
Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19: 1607–1624
Article Google Scholar
Chen K, Liu L (2005) The ‘best k’ for entropy-based categorical data clustering. In: Proceedings of the 17th international conference on scientific and statistical database management, Santa Barbara, CA, USA, pp 253–262
Chen HL, Chuang KT, Chen MS (2008) On data labeling for clustering categorical data. IEEE Trans Knowl Data Eng 20: 1458–1472
Article Google Scholar
Das K, Schneider J (2007) Detecting anomalous records in categorical datasets. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, USA, pp 220–229
Ding C, He X (2002) Cluster merging and splitting in hierarchical clustering algorithms. In: Proceedings of the 2nd IEEE international conference on data mining, Maebashi, Japan, pp 139–146
Do H, Kim J (2008) Categorical data clustering using the combinations of attribute values. Lecture notes in computer science. Springer, New York
Google Scholar
Drineas P, Drinea E, Huggins P (2003) An experimental evaluation of a Monte-Carlo algorithm for singular value decomposition. Lecture notes in computer science. Springer, New York
Google Scholar
Everitt B, Landau S, Leese M (2001) Cluster analysis 4. Arnold Publishers, London
MATH Google Scholar
Gan G, Wu J (2004) Subspace clustering for high dimensional categorical data. SIGKDD Explor 6(2): 87–94
Article Google Scholar
Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS: clustering categorical data using summaries. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, USA, pp 73–83
Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min Knowl Discov 6(4): 303–360
Article MathSciNet Google Scholar
Greenacre M (1993) Correspondence analysis in practice. Academic Press, London
Google Scholar
Greenacre M, Blasius J (2006) Multiple correspondence analysis and related methods. Chapman & Hall, London
Book MATH Google Scholar
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the 15th IEEE international conference on data engineering, Sydney, Australia, pp 512–521
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical value. Data Mining Knowl Discov 2: 283–304
Article Google Scholar
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5): 657–668
Article Google Scholar
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Article Google Scholar
Jin R, Breitbart Y, Muoh C (2009) Data discretization unification. Knowl Inf Syst 19: 1–29
Article Google Scholar
Keogh E, Lonardi S, Ratanamahatana C (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA, pp 206–215
Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of the 21st international conference on machine learning, Banff, Alberta, Canada, pp 536–543
Lu Y, Wang S, Li S, Zhou C (2011) Particle swarm optimizer for variable weighting in clustering high-dimensional data. Mach Learn 82: 43–70
Article Google Scholar
Messaoud R, Boussaid O, Rabaseda S (2006) Efficient multidimensional data representations based on multiple correspondence analysis. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA, pp 662–667
Parsons L, Haque E, Liu H (2004) Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor Newsl 6(1): 90–105
Article Google Scholar
Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In: Proceedings of the 16th IEEE international conference on tools with artificial intelligence, Boca Raton, FL, USA, pp 576–584
San O, Huynh V, Nakamori Y (2004) An alternative extension of the k-means algorithm for clustering categorical data. Int J Appl Math Comput Sci 14(2): 241–247
MATH MathSciNet Google Scholar
Sun H, Wang S, Jiang Q (2004) FCM-based model selection algorithm for determining the number of clusters. Pattern Recognit 37(10): 2027–2037
Article MATH Google Scholar
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Reading
Google Scholar
Wang K, Xu C, Liu B (1999) Clustering transactions using large items. In: Proceeding of the 10th ACM conference on information and knowledge management (CIKM), Kansas City, MO, USA, pp 483–490
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, pp 877–885
Xiong T, Wang S, Mayers A, Monga E (2008) Personal bankruptcy prediction using sequence mining. In: Proceeding of KDD2008 workshop on data mining for business applications, Las Vegas, NV, USA, pp 32–38
Xiong T, Wang S, Mayers A, Monga E (2009) A new MCA-based divisive hierarchical algorithm for clustering categorical data. In: Proceedings of the 9th IEEE international conference on data mining, Miami, FL, USA, pp 1058–1063
Yan H, Chen K, Liu L (2006) Efficiently clustering transactional data with weighted coverage density. In: Proceeding of the 17th ACM conference on information and knowledge management (CIKM), Arlington, Virginia, USA, pp 367–376
Yang Y, Guan X, You J (2002) CLOPE: a fast and effective clustering algorithm for transactional data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, pp 682–687
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Mining Knowl Discov 10: 141–168
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sherbrooke, Sherbrooke, QC, J1K 2R1, Canada
Tengke Xiong, Shengrui Wang & André Mayers
Department of Mathematics, University of Sherbrooke, Sherbrooke, QC, J1K 2R1, Canada
Ernest Monga

Authors

Tengke Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Shengrui Wang
View author publications
You can also search for this author in PubMed Google Scholar
André Mayers
View author publications
You can also search for this author in PubMed Google Scholar
Ernest Monga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tengke Xiong.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiong, T., Wang, S., Mayers, A. et al. DHCC: Divisive hierarchical clustering of categorical data. Data Min Knowl Disc 24, 103–135 (2012). https://doi.org/10.1007/s10618-011-0221-2

Download citation

Received: 02 February 2010
Accepted: 04 May 2011
Published: 13 May 2011
Issue Date: January 2012
DOI: https://doi.org/10.1007/s10618-011-0221-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DHCC: Divisive hierarchical clustering of categorical data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DHCC: Divisive hierarchical clustering of categorical data

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation