ABSTRACT
Is there an optimal dimensionality reduction for k-means, revealing the prominent cluster structure hidden in the data? We propose SUBKMEANS, which extends the classic k-means algorithm. The goal of this algorithm is twofold: find a sufficient k-means-style clustering partition and transform the clusters onto a common subspace, which is optimal for the cluster structure. Our solution is able to pursue these two goals simultaneously. The dimensionality of this subspace is found automatically and therefore the algorithm comes without the burden of additional parameters. At the same time this subspace helps to mitigate the curse of dimensionality. The SUBKMEANS optimization algorithm is intriguingly simple and efficient. It is easy to implement and can readily be adopted to the current situation. Furthermore, it is compatible to many existing extensions and improvements of k-means.
Supplemental Material
- Charu C Aggarwal, Joel L Wolf, Philip S Yu, Cecilia Procopiuc, and Jong Soo Park 1999. Fast algorithms for projected clustering. In ACM SIGMoD Record, Vol. Vol. 28. ACM, 61--72.Google ScholarDigital Library
- Charu C. Aggarwal and Philip S. Yu 2000. Finding Generalized Projected Clusters in High Dimensional Spaces Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, NY, USA, 70--81. https://doi.org/10.1145/342009.335383 Google ScholarDigital Library
- Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. Vol. 27. ACM. Google ScholarDigital Library
- David Arthur and Sergei Vassilvitskii 2007. k-means: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.Google Scholar
- Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 622--633. Google ScholarDigital Library
- Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek 2004. Computing clusters of correlation connected objects Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 455--466.Google Scholar
- Christian Böhm and Claudia Plant 2015. Mining Massive Vector Data on Single Instruction Multiple Data Microarchitectures Data Mining Workshop (ICDMW), 2015 IEEE International Conference on. IEEE, 597--606.Google Scholar
- Fernando De la Torre and Takeo Kanade 2006. Discriminative cluster analysis. In Proceedings of the 23rd international conference on Machine learning. ACM, 241--248. Google ScholarDigital Library
- Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2004. Kernel k-means: spectral clustering and normalized cuts Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22--25, 2004. 551--556. https://doi.org/10.1145/1014052.1014118Google ScholarDigital Library
- Chris Ding and Tao Li. 2007. Adaptive dimension reduction using discriminant analysis and k-means clustering Proceedings of the 24th international conference on Machine learning. ACM, 521--528.Google Scholar
- Joseph C Dunn. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. (1973).Google Scholar
- Charles Elkan. 2003. Using the triangle inequality to accelerate k-means ICML, Vol. Vol. 3. 147--153.Google ScholarDigital Library
- Sebastian Goebl, Xiao He, Claudia Plant, and Christian Böhm. 2014. Finding the optimal subspace for clustering. In Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 130--139. Google ScholarDigital Library
- Robert M. Gray and David L. Neuhoff 1998. Quantization. IEEE transactions on information theory Vol. 44, 6 (1998), 2325--2383. Google ScholarDigital Library
- Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130--140. Google ScholarCross Ref
- Greg Hamerly and Charles Elkan 2004. Learning the k in k-means. (2004).Google Scholar
- Aapo Hyvärinen and Erkki Oja 2000. Independent component analysis: algorithms and applications. Neural networks, Vol. 13, 4 (2000), 411--430. Google ScholarDigital Library
- Anil K Jain and Richard C Dubes 1988. Algorithms for clustering data. Prentice-Hall, Inc.Google ScholarDigital Library
- Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library.Google Scholar
- Brian Kulis and Michael I Jordan 2012. Revisiting k-means: New algorithms via Bayesian nonparametrics. In Proceedings of the 23rd International Conference on Machine Learning (2012).Google Scholar
- Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory Vol. 28, 2 (1982), 129--137. Google ScholarDigital Library
- Dijun Luo, Chris HQ Ding, and Heng Huang 2011. Linear Discriminant Analysis: New Formulations and Overfit Analysis. AAAI.Google Scholar
- Andrew W Moore. 1999. Very fast EM-based mixture model clustering using multiresolution kd-trees. Advances in Neural information processing systems (1999), 543--549.Google Scholar
- Dan Pelleg and Andrew Moore 1999. Accelerating exact k-means algorithms with geometric reasoning Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 277--281.Google Scholar
- Dan Pelleg, Andrew W Moore, and others 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML, Vol. Vol. 1.Google Scholar
- Steven J Phillips. 2002. Acceleration of k-means and related clustering algorithms Workshop on Algorithm Engineering and Experimentation. Springer, 166--177.Google Scholar
- Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller 1997. Kernel principal component analysis. In International Conference on Artificial Neural Networks. Springer, 583--588. Google ScholarCross Ref
- Michael Steinbach, George Karypis, Vipin Kumar, and others. 2000. A comparison of document clustering techniques. In KDD workshop on text mining, Vol. Vol. 400. Boston, 525--526.Google Scholar
- Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.Google Scholar
- Max Welling and Kenichi Kurihara 2006. Bayesian K-Means as a" Maximization-Expectation" Algorithm. SDM. SIAM, 474--478.Google Scholar
- Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, and others 2008. Top 10 algorithms in data mining. Knowledge and information systems Vol. 14, 1 (2008), 1--37. Google ScholarDigital Library
- Jieping Ye, Zheng Zhao, and Mingrui Wu 2007. Discriminative K-means for Clustering. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. 1649--1656. http://papers.nips.cc/paper/3176-discriminative-k-means-for-clusteringGoogle Scholar
Index Terms
- Towards an Optimal Subspace for K-Means
Recommendations
Discovering Non-Redundant K-means Clusterings in Optimal Subspaces
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningA huge object collection in high-dimensional space can often be clustered in more than one way, for instance, objects could be clustered by their shape or alternatively by their color. Each grouping represents a different view of the data set. The new ...
Non-Redundant Subspace Clusterings with Nr-Kmeans and Nr-DipMeans
Special Issue on KDD 2018, Regular Papers and Survey PaperA huge object collection in high-dimensional space can often be clustered in more than one way, for instance, objects could be clustered by their shape or alternatively by their color. Each grouping represents a different view of the dataset. The new ...
k-means discriminant maps for data visualization and classification
SAC '08: Proceedings of the 2008 ACM symposium on Applied computingOver the years, many dimensionality reduction algorithms have been proposed for learning the structure of high dimensional data by linearly or non-linearly transforming it into a low-dimensional space. Some techniques can keep the local structure of ...
Comments