Skip to main content

Feature Selection for Clustering

  • Living reference work entry
  • First Online:
Encyclopedia of Database Systems

Definition

The problem of feature selection originates from the fact that while collecting data, one tends to collect all possible data. But for a specific learning task such as clustering not all the attributes or features are important. Feature selection is popular in supervised learning or for the classification task because the class labels are given and it is easier to select those features that lead to these classes. But for unsupervised data without class labels, or for the clustering task, it is not so obvious which features are to be selected. Some of the features may be redundant, some are irrelevant, and others may be “weakly relevant”. The task of feature selection for clustering is to select “best” set of relevant features that helps to uncover the natural clusters from data according to the chosen criterion.

Figure 1 shows an example using a synthetic data. There are three clusters in F1-F2 dimensions which follow Gaussian distribution whereas F3, which does not define...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Recommended Reading

  1. Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS. Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD international conference on management of data, 1999. p. 61–72.

    Google Scholar 

  2. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, 1998. p. 94–105.

    Google Scholar 

  3. Amershi S, Conati C, Maclaren H. Using feature selection and unsupervised clustering to identify affective expressions in educational games. In: Proceedings of the workshop on motivational and affective issues in ITS, 8th International conference on ITS, 2006. p. 21–8.

    Google Scholar 

  4. Bekkerman R, El-Yaniv R, Tishby N, Winter Y. Distributional word clusters vs words for text categorization. J Mach Lear Res. 2008;3:1183–208.

    MATH  Google Scholar 

  5. Dash M, Choi K, Scheuermann P, Liu H. Feature selection for clustering – a filter solution. In: Proceedings of the 2002 IEEE international conference on data mining, 2002. p. 115–22.

    Google Scholar 

  6. Dash M, Liu H. Feature selection for classification. Int J Intell Data Analy. 1997;1(3):131–56.

    Article  Google Scholar 

  7. Dash M, Liu H. Handling large unsupervised data via dimensionality reduction. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery, 1999.

    Google Scholar 

  8. Devaney M, Ram A. Efficient feature selection in conceptual clustering. In: Proceedings of the 14th international conference on machine learning, 1997. p. 92–7.

    Google Scholar 

  9. Duda RO, Hart PE. Pattern classification and scene analysis, Chap. Unsupervised learning and clustering. New York: Wiley, 1973.

    Google Scholar 

  10. Dy JG, Brodley CE. Feature subset selection and order identification for unsupervised learning. In: Proceedings of the 17th international conference on machine learning, 2000. p. 247–54.

    Google Scholar 

  11. Dy JG, Brodley CE. Visualization and interactive feature selection for unsupervised data. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, 2000. p. 360–4.

    Google Scholar 

  12. Dy JG, Brodley E. Feature selection for unsupervised learning. J Mach Learn Res. 2004;5:845–89.

    MathSciNet  MATH  Google Scholar 

  13. Fisher DH. Knowledge acquisition via incremental conceptual clustering. Mach Learn. 1987;2:139–72.

    Google Scholar 

  14. Friedman J, Meulman J. Clustering objects on subsets of attributes. J Royal Stat Soc B. 2004;66(4):1–25.

    Article  MathSciNet  MATH  Google Scholar 

  15. Gilad-Bachrach R, Navot A, Tishby N. Margin based feature selection – theory and algorithms. In: Proceedings of the 21st international conference on machine learning, 2004. p. 43.

    Google Scholar 

  16. Jain AK, Dubes RC. Algorithm for clustering data, Chap. Clustering methods and algorithms. Prentice-hall advanced reference series, 1988.

    Google Scholar 

  17. Kim YS, Street WN, Menczer F. Feature selection in unsupervised learning via evolutionary search. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, 2000. p. 365–9.

    Google Scholar 

  18. Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell. 2004;26(9):1154–66.

    Article  Google Scholar 

  19. Milligan GW. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika. 1981;46(2):187–98.

    Article  MathSciNet  MATH  Google Scholar 

  20. Talavera L. Feature selection as a preprocessing step for hierarchical clustering. In: Proceedings of the 16th international conference on machine learning, 1999. p. 389–97.

    Google Scholar 

  21. Talavera L. Feature selection and incremental learning of probabilistic concept hierarchies. In: Proceedings of the 17th international conference on machine learning, 2000. p. 951–8.

    Google Scholar 

  22. Vaithyanathan S, Dom B. Model selection in unsupervised learning with applications to document clustering. In: Proceedings of the 16th international conference on machine learning, 1999. p. 433–43.

    Google Scholar 

  23. Xing EP, Karp RM. CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. In: Proceedings of the 9th international conference on intelligent systems for molecular biology, 2001. p. 306–15.

    Google Scholar 

  24. Yousef M, Jung S, Showe LC, Showe MK. Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics. 2009;8:144.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manoranjan Dash .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer Science+Business Media LLC

About this entry

Cite this entry

Dash, M., Koot, P.W. (2016). Feature Selection for Clustering. In: Liu, L., Özsu, M. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4899-7993-3_613-2

Download citation

  • DOI: https://doi.org/10.1007/978-1-4899-7993-3_613-2

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, New York, NY

  • Online ISBN: 978-1-4899-7993-3

  • eBook Packages: Springer Reference Computer SciencesReference Module Computer Science and Engineering

Publish with us

Policies and ethics