Clustering Mixed Type Attributes in Large Dataset

Yin, Jian; Tan, Zhifang

doi:10.1007/11576235_66

Clustering Mixed Type Attributes in Large Dataset

Jian Yin²¹ &
Zhifang Tan²¹

Conference paper

853 Accesses
8 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3758))

Abstract

Clustering is a widely used technique in data mining, now there exists many clustering algorithms, but most existing clustering algorithms either are limited to handle the single attribute or can handle both data types but are not efficient when clustering large data sets. Few algorithms can do both well. In this paper, we propose a clustering algorithm CFIKP that can handle large datasets with mixed type of attributes. We first use CF ^*-tree to pre-cluster datasets. After the dense regions are stored in leaf nodes, then we look every dense region as a single point and use an improved k-prototype to cluster such dense regions. Experiments show that the CFIKP algorithm is very efficient in clustering large datasets with mixed type of attributes.

This work is supported by the National Natural Science Foundation of China (60205007), Natural Science Foundation of Guangdong Province (031558, 04300462), Research Foundation of National Science and Technology Plan Project (2004BA721A02), Research Foundation of Science and Technology Plan Project in Guangdong Province (2003C50118) and Research Foundation of Science and Technology Plan Project in Guangzhou City(2002Z3-E0017).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Pro. 5th Berkeley Symp. Math. Statist, Pro., vol. 1, pp. 128–297 (1967)
Google Scholar
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovering 2, 283–304 (1998)
Article Google Scholar
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Pro. 1994 Int. Conf. Very Large Data Bases, pp. 144–155 (1994)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clustering in large spatial database with noise. In: Proc. 1996 Int. Conf. Knowledge Discovering and Data Mining, pp. 266–231 (1996)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGKDD Int. Conf. Managament of Data, pp. 103–114 (1996)
Google Scholar
Chiu, T., Fang, D.P., Chen, J., Wang, Y.: A Robust and Scalable Clustering Algorithm for Mixed Type Attributes in Large Database Environment. In: Proc. ACM-SIGKDD int. conf. Knowledge discovery and data mining (KDD 2001), pp. 263–268 (2001)
Google Scholar
Chen, P., Wang, Y.: An Efficient clustering algorithm for categorical and mixed typed attributes. Computer Engineering and Application (1), 190–191 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Zhongshan University, Guangzhou, 510275, China
Jian Yin & Zhifang Tan

Authors

Jian Yin
View author publications
You can also search for this author in PubMed Google Scholar
Zhifang Tan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of CS, Georgia State University, 30302, Atlanta, GA, USA
Yi Pan
State Key Laboratory for Novel Software Technology, Nanjing University, 210093, Nanjing, Jiangsu, China
Daoxu Chen
Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Minyi Guo
Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong, China
Jiannong Cao
Computer Science Department, University of Tennessee, 37996-3450, Knoxville, TN, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yin, J., Tan, Z. (2005). Clustering Mixed Type Attributes in Large Dataset. In: Pan, Y., Chen, D., Guo, M., Cao, J., Dongarra, J. (eds) Parallel and Distributed Processing and Applications. ISPA 2005. Lecture Notes in Computer Science, vol 3758. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11576235_66

Download citation

DOI: https://doi.org/10.1007/11576235_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29769-7
Online ISBN: 978-3-540-32100-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics