Spark2Fires: A New Parallel Approximate Subspace Clustering Algorithm

Zhu, Bo; Mozo, Alberto

doi:10.1007/978-3-319-44066-8_16

Spark2Fires: A New Parallel Approximate Subspace Clustering Algorithm

Bo Zhu²⁰ &
Alberto Mozo²⁰

Conference paper
First Online: 14 August 2016

459 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 637))

Abstract

Subspace clustering is an interesting investigation field that has been intensively studied in the last two decades. The objective of subspace clustering is to find all lower-dimensional clusters hidden in subspaces of high dimensional data. Although the majority of existing subspace clustering algorithms adopt certain heuristic pruning techniques to reduce the search space, the time complexity of such algorithms remain exponential with regard to the highest dimensionality of hidden subspace clusters. Even with help of parallelism, these techniques will require extremely high computational time in practice. In this paper we propose a novel subspace clustering technique that reduces the exponential time complexity to quadratic via approximation. We also provide a parallel implementation of proposed algorithm on top of Apache Spark to further accelerate our approach on large data sets. Preliminary experiment results show our algorithm performs much better especially considering the scalability with regard to the dimensionality of hidden clusters.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMoD Record, vol. 28, pp. 61–72. ACM (1999)
Google Scholar
Huang, X., Ye, Y., Guo, H., Cai, Y., Zhang, H., Li, Y.: DSKmeans: a new kmeans-type approach to discriminative subspace clustering. Knowl.-Based Syst. 70, 293–300 (2014)
Article Google Scholar
Gan, G., Ng, M.K.-P.: Subspace clustering using affinity propagation. Pattern Recogn. 48(4), 1455–1464 (2015)
Article Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)
Google Scholar
Goil, S., Nagesh, H., Choudhary, A.: MAFIA: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)
Google Scholar
Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)
Google Scholar
Sequeira, K., Zaki, M.: SCHISM: a new approach for interesting subspace mining. In: Fourth IEEE International Conference on Data Mining 2004, pp. 186–193. IEEE (2004)
Google Scholar
Liu, G., Li, J., Sim, K., Wong, L.: Distance based subspace clustering with flexible dimension partitioning. In: IEEE 23rd International Conference on Data Engineering 2007, pp. 1250–1254. IEEE (2007)
Google Scholar
Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of SDM, vol. 4. SIAM (2004)
Google Scholar
Zhu, B., Mara, A., Mozo, A.: CLUS: parallel subspace clustering algorithm on spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 175–185. Springer, Heidelberg (2015)
Chapter Google Scholar
Assent, I., Krieger, R., Muller, E., Seidl, T.: INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 719–724. IEEE (2008)
Google Scholar
Günnemann, S., Boden, B., Seidl, T.: DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS, vol. 6911, pp. 565–580. Springer, Heidelberg (2011)
Chapter Google Scholar
Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data (TKDD) 3(1), 1 (2009)
Google Scholar
Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endowment 2(1), 1270–1281 (2009)
Article Google Scholar
Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)
Google Scholar
Assent, I., Krieger, R., Muller, E., Seidl, T.: DUSC: dimensionality unbiased subspace clustering. In: Seventh IEEE International Conference on Data Mining 2007, pp. 409–414. IEEE (2007)
Google Scholar
Kriegel, H.-P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: Fifth IEEE International Conference on Data Mining, p. 8. IEEE (2005)
Google Scholar
Zhu, B., Ordozgoiti, B., Mozo, A.: Psceg: an unbiased parallel subspace clustering algorithm using exact grids. ESANN (2016)
Google Scholar

Download references

Acknowledgement

The research leading to these results has received funding from the European Union under the FP7 grant agreement no. 619633 (project ONTIC) and H2020 grant agreement no. 671625 (project CogNet)

Author information

Authors and Affiliations

Universidad Politécnica de Madrid, Madrid, Spain
Bo Zhu & Alberto Mozo

Authors

Bo Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Mozo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Zhu .

Editor information

Editors and Affiliations

Faculty of Sciences, University of Novi Sad Faculty of Sciences, Novi Sad, Serbia
Mirjana Ivanović
Christian-Albrechts-Universität Kiel, Kiel, Germany
Bernhard Thalheim
University of Genoa, Genoa, Italy
Barbara Catania
Software Competence Cent. Hagenberg GmbH, Hagenberg, Austria
Klaus-Dieter Schewe
Riga Technical University, Riga, Latvia
Mārīte Kirikova
VSB-Technical University Ostrava, Ostrava, Czech Republic
Petr Šaloun
Georgia College and State University, Milledgeville, Georgia, USA
Ajantha Dahanayake
Politecnico di Torino, Torino, Italy
Tania Cerquitelli
Politecnico di Torino , Torino, Italy
Elena Baralis
EURECOM, Biot Sophia Antipolis cedex, France
Pietro Michiardi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, B., Mozo, A. (2016). Spark2Fires: A New Parallel Approximate Subspace Clustering Algorithm. In: Ivanović, M., et al. New Trends in Databases and Information Systems. ADBIS 2016. Communications in Computer and Information Science, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-319-44066-8_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-44066-8_16
Published: 14 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44065-1
Online ISBN: 978-3-319-44066-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics