Feature Selection for Clustering

Dash, Manoranjan; Koot, Poon Wei

doi:10.1007/978-1-4899-7993-3_613-2

Manoranjan Dash³ &
Poon Wei Koot³

44 Accesses

Definition

The problem of feature selection originates from the fact that while collecting data, one tends to collect all possible data. But for a specific learning task such as clustering not all the attributes or features are important. Feature selection is popular in supervised learning or for the classification task because the class labels are given and it is easier to select those features that lead to these classes. But for unsupervised data without class labels, or for the clustering task, it is not so obvious which features are to be selected. Some of the features may be redundant, some are irrelevant, and others may be “weakly relevant”. The task of feature selection for clustering is to select “best” set of relevant features that helps to uncover the natural clusters from data according to the chosen criterion.

Figure 1 shows an example using a synthetic data. There are three clusters in F1-F2 dimensions which follow Gaussian distribution whereas F3, which does not define...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Recommended Reading

Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS. Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, 1999. p. 61–72.
Google Scholar
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, 1998. p. 94–105.
Google Scholar
Amershi S, Conati C, Maclaren H. Using feature selection and unsupervised clustering to identify affective expressions in educational games. In: Proceedings of the workshop on motivational and affective issues in ITS, 8th International conference on ITS, 2006. p. 21–8.
Google Scholar
Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word clusters vs words for text categorization. J Mach Lear Res. 2008;3:1183–208.
MATH Google Scholar
Dash M, Choi K, Scheuermann P, Liu H. Feature selection for clustering – a filter solution. In: Proceedings of the 2002 IEEE international conference on data mining, 2002. p. 115–22.
Google Scholar
Dash M, Liu H. Feature selection for classification. Int J Intell Data Analy. 1997;1(3):131–56.
Article Google Scholar
Dash M, Liu H. Handling large unsupervised data via dimensionality reduction. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery, 1999.
Google Scholar
Devaney M, Ram A. Efficient feature selection in conceptual clustering. In: Proceedings of the 14th international conference on machine learning, 1997. p. 92–7.
Google Scholar
Duda RO, Hart PE. Pattern classification and scene analysis, Chap. Unsupervised learning and clustering. New York: Wiley, 1973.
Google Scholar
Dy JG, Brodley CE. Feature subset selection and order identification for unsupervised learning. In: Proceedings of the 17th international conference on machine learning, 2000. p. 247–54.
Google Scholar
Dy JG, Brodley CE. Visualization and interactive feature selection for unsupervised data. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, 2000. p. 360–4.
Google Scholar
Dy JG, Brodley E. Feature selection for unsupervised learning. J Mach Learn Res. 2004;5:845–89.
MathSciNet MATH Google Scholar
Fisher DH. Knowledge acquisition via incremental conceptual clustering. Mach Learn. 1987;2:139–72.
Google Scholar
Friedman J, Meulman J. Clustering objects on subsets of attributes. J Royal Stat Soc B. 2004;66(4):1–25.
Article MathSciNet MATH Google Scholar
Gilad-Bachrach R, Navot A, Tishby N. Margin based feature selection – theory and algorithms. In: Proceedings of the 21st international conference on machine learning, 2004. p. 43.
Google Scholar
Jain AK, Dubes RC. Algorithm for clustering data, Chap. Clustering methods and algorithms. Prentice-hall advanced reference series, 1988.
Google Scholar
Kim YS, Street WN, Menczer F. Feature selection in unsupervised learning via evolutionary search. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, 2000. p. 365–9.
Google Scholar
Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell. 2004;26(9):1154–66.
Article Google Scholar
Milligan GW. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika. 1981;46(2):187–98.
Article MathSciNet MATH Google Scholar
Talavera L. Feature selection as a preprocessing step for hierarchical clustering. In: Proceedings of the 16th international conference on machine learning, 1999. p. 389–97.
Google Scholar
Talavera L. Feature selection and incremental learning of probabilistic concept hierarchies. In: Proceedings of the 17th international conference on machine learning, 2000. p. 951–8.
Google Scholar
Vaithyanathan S, Dom B. Model selection in unsupervised learning with applications to document clustering. In: Proceedings of the 16th international conference on machine learning, 1999. p. 433–43.
Google Scholar
Xing EP, Karp RM. CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. In: Proceedings of the 9th international conference on intelligent systems for molecular biology, 2001. p. 306–15.
Google Scholar
Yousef M, Jung S, Showe LC, Showe MK. Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics. 2009;8:144.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Nanyang Technological University, Singapore, Singapore
Manoranjan Dash & Poon Wei Koot

Authors

Manoranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar
Poon Wei Koot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manoranjan Dash .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, Georgia, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, Ontario, Canada
M. Tamer Özsu

Section Editor information

Department of Computer Science and Engineering, The University of California at Riverside, Bourns College of Engineering, 92521, Riverside, CA, USA
Dimitrios Gunopulos

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Dash, M., Koot, P.W. (2016). Feature Selection for Clustering. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_613-2

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7993-3_613-2
Received: 25 April 2016
Accepted: 21 October 2016
Published: 18 January 2017
Publisher Name: Springer, New York, NY
Online ISBN: 978-1-4899-7993-3
eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics