Abstract
Subspace clustering is an interesting investigation field that has been intensively studied in the last two decades. The objective of subspace clustering is to find all lower-dimensional clusters hidden in subspaces of high dimensional data. Although the majority of existing subspace clustering algorithms adopt certain heuristic pruning techniques to reduce the search space, the time complexity of such algorithms remain exponential with regard to the highest dimensionality of hidden subspace clusters. Even with help of parallelism, these techniques will require extremely high computational time in practice. In this paper we propose a novel subspace clustering technique that reduces the exponential time complexity to quadratic via approximation. We also provide a parallel implementation of proposed algorithm on top of Apache Spark to further accelerate our approach on large data sets. Preliminary experiment results show our algorithm performs much better especially considering the scalability with regard to the dimensionality of hidden clusters.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMoD Record, vol. 28, pp. 61–72. ACM (1999)
Huang, X., Ye, Y., Guo, H., Cai, Y., Zhang, H., Li, Y.: DSKmeans: a new kmeans-type approach to discriminative subspace clustering. Knowl.-Based Syst. 70, 293–300 (2014)
Gan, G., Ng, M.K.-P.: Subspace clustering using affinity propagation. Pattern Recogn. 48(4), 1455–1464 (2015)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Goil, S., Nagesh, H., Choudhary, A.: MAFIA: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)
Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)
Sequeira, K., Zaki, M.: SCHISM: a new approach for interesting subspace mining. In: Fourth IEEE International Conference on Data Mining 2004, pp. 186–193. IEEE (2004)
Liu, G., Li, J., Sim, K., Wong, L.: Distance based subspace clustering with flexible dimension partitioning. In: IEEE 23rd International Conference on Data Engineering 2007, pp. 1250–1254. IEEE (2007)
Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of SDM, vol. 4. SIAM (2004)
Zhu, B., Mara, A., Mozo, A.: CLUS: parallel subspace clustering algorithm on spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 175–185. Springer, Heidelberg (2015)
Assent, I., Krieger, R., Muller, E., Seidl, T.: INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 719–724. IEEE (2008)
Günnemann, S., Boden, B., Seidl, T.: DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS, vol. 6911, pp. 565–580. Springer, Heidelberg (2011)
Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data (TKDD) 3(1), 1 (2009)
Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endowment 2(1), 1270–1281 (2009)
Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)
Assent, I., Krieger, R., Muller, E., Seidl, T.: DUSC: dimensionality unbiased subspace clustering. In: Seventh IEEE International Conference on Data Mining 2007, pp. 409–414. IEEE (2007)
Kriegel, H.-P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: Fifth IEEE International Conference on Data Mining, p. 8. IEEE (2005)
Zhu, B., Ordozgoiti, B., Mozo, A.: Psceg: an unbiased parallel subspace clustering algorithm using exact grids. ESANN (2016)
Acknowledgement
The research leading to these results has received funding from the European Union under the FP7 grant agreement no. 619633 (project ONTIC) and H2020 grant agreement no. 671625 (project CogNet)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, B., Mozo, A. (2016). Spark2Fires: A New Parallel Approximate Subspace Clustering Algorithm. In: Ivanović, M., et al. New Trends in Databases and Information Systems. ADBIS 2016. Communications in Computer and Information Science, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-319-44066-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-44066-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44065-1
Online ISBN: 978-3-319-44066-8
eBook Packages: Computer ScienceComputer Science (R0)