skip to main content
10.1145/3097983.3097989acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Towards an Optimal Subspace for K-Means

Authors Info & Claims
Published:04 August 2017Publication History

ABSTRACT

Is there an optimal dimensionality reduction for k-means, revealing the prominent cluster structure hidden in the data? We propose SUBKMEANS, which extends the classic k-means algorithm. The goal of this algorithm is twofold: find a sufficient k-means-style clustering partition and transform the clusters onto a common subspace, which is optimal for the cluster structure. Our solution is able to pursue these two goals simultaneously. The dimensionality of this subspace is found automatically and therefore the algorithm comes without the burden of additional parameters. At the same time this subspace helps to mitigate the curse of dimensionality. The SUBKMEANS optimization algorithm is intriguingly simple and efficient. It is easy to implement and can readily be adopted to the current situation. Furthermore, it is compatible to many existing extensions and improvements of k-means.

Skip Supplemental Material Section

Supplemental Material

mautz_optimal_subspace.mp4

mp4

379.7 MB

References

  1. Charu C Aggarwal, Joel L Wolf, Philip S Yu, Cecilia Procopiuc, and Jong Soo Park 1999. Fast algorithms for projected clustering. In ACM SIGMoD Record, Vol. Vol. 28. ACM, 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Charu C. Aggarwal and Philip S. Yu 2000. Finding Generalized Projected Clusters in High Dimensional Spaces Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, NY, USA, 70--81. https://doi.org/10.1145/342009.335383 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. Vol. 27. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David Arthur and Sergei Vassilvitskii 2007. k-means: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.Google ScholarGoogle Scholar
  5. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 622--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek 2004. Computing clusters of correlation connected objects Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 455--466.Google ScholarGoogle Scholar
  7. Christian Böhm and Claudia Plant 2015. Mining Massive Vector Data on Single Instruction Multiple Data Microarchitectures Data Mining Workshop (ICDMW), 2015 IEEE International Conference on. IEEE, 597--606.Google ScholarGoogle Scholar
  8. Fernando De la Torre and Takeo Kanade 2006. Discriminative cluster analysis. In Proceedings of the 23rd international conference on Machine learning. ACM, 241--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2004. Kernel k-means: spectral clustering and normalized cuts Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22--25, 2004. 551--556. https://doi.org/10.1145/1014052.1014118Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chris Ding and Tao Li. 2007. Adaptive dimension reduction using discriminant analysis and k-means clustering Proceedings of the 24th international conference on Machine learning. ACM, 521--528.Google ScholarGoogle Scholar
  11. Joseph C Dunn. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. (1973).Google ScholarGoogle Scholar
  12. Charles Elkan. 2003. Using the triangle inequality to accelerate k-means ICML, Vol. Vol. 3. 147--153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sebastian Goebl, Xiao He, Claudia Plant, and Christian Böhm. 2014. Finding the optimal subspace for clustering. In Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 130--139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Robert M. Gray and David L. Neuhoff 1998. Quantization. IEEE transactions on information theory Vol. 44, 6 (1998), 2325--2383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130--140. Google ScholarGoogle ScholarCross RefCross Ref
  16. Greg Hamerly and Charles Elkan 2004. Learning the k in k-means. (2004).Google ScholarGoogle Scholar
  17. Aapo Hyvärinen and Erkki Oja 2000. Independent component analysis: algorithms and applications. Neural networks, Vol. 13, 4 (2000), 411--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Anil K Jain and Richard C Dubes 1988. Algorithms for clustering data. Prentice-Hall, Inc.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library.Google ScholarGoogle Scholar
  20. Brian Kulis and Michael I Jordan 2012. Revisiting k-means: New algorithms via Bayesian nonparametrics. In Proceedings of the 23rd International Conference on Machine Learning (2012).Google ScholarGoogle Scholar
  21. Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory Vol. 28, 2 (1982), 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dijun Luo, Chris HQ Ding, and Heng Huang 2011. Linear Discriminant Analysis: New Formulations and Overfit Analysis. AAAI.Google ScholarGoogle Scholar
  23. Andrew W Moore. 1999. Very fast EM-based mixture model clustering using multiresolution kd-trees. Advances in Neural information processing systems (1999), 543--549.Google ScholarGoogle Scholar
  24. Dan Pelleg and Andrew Moore 1999. Accelerating exact k-means algorithms with geometric reasoning Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 277--281.Google ScholarGoogle Scholar
  25. Dan Pelleg, Andrew W Moore, and others 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML, Vol. Vol. 1.Google ScholarGoogle Scholar
  26. Steven J Phillips. 2002. Acceleration of k-means and related clustering algorithms Workshop on Algorithm Engineering and Experimentation. Springer, 166--177.Google ScholarGoogle Scholar
  27. Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller 1997. Kernel principal component analysis. In International Conference on Artificial Neural Networks. Springer, 583--588. Google ScholarGoogle ScholarCross RefCross Ref
  28. Michael Steinbach, George Karypis, Vipin Kumar, and others. 2000. A comparison of document clustering techniques. In KDD workshop on text mining, Vol. Vol. 400. Boston, 525--526.Google ScholarGoogle Scholar
  29. Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.Google ScholarGoogle Scholar
  30. Max Welling and Kenichi Kurihara 2006. Bayesian K-Means as a" Maximization-Expectation" Algorithm. SDM. SIAM, 474--478.Google ScholarGoogle Scholar
  31. Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, and others 2008. Top 10 algorithms in data mining. Knowledge and information systems Vol. 14, 1 (2008), 1--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jieping Ye, Zheng Zhao, and Mingrui Wu 2007. Discriminative K-means for Clustering. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. 1649--1656. http://papers.nips.cc/paper/3176-discriminative-k-means-for-clusteringGoogle ScholarGoogle Scholar

Index Terms

  1. Towards an Optimal Subspace for K-Means

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
              August 2017
              2240 pages
              ISBN:9781450348874
              DOI:10.1145/3097983

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 4 August 2017

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

              Upcoming Conference

              KDD '24

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader