Skip to main content

Spark2Fires: A New Parallel Approximate Subspace Clustering Algorithm

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 637))

Abstract

Subspace clustering is an interesting investigation field that has been intensively studied in the last two decades. The objective of subspace clustering is to find all lower-dimensional clusters hidden in subspaces of high dimensional data. Although the majority of existing subspace clustering algorithms adopt certain heuristic pruning techniques to reduce the search space, the time complexity of such algorithms remain exponential with regard to the highest dimensionality of hidden subspace clusters. Even with help of parallelism, these techniques will require extremely high computational time in practice. In this paper we propose a novel subspace clustering technique that reduces the exponential time complexity to quadratic via approximation. We also provide a parallel implementation of proposed algorithm on top of Apache Spark to further accelerate our approach on large data sets. Preliminary experiment results show our algorithm performs much better especially considering the scalability with regard to the dimensionality of hidden clusters.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: ACM SIGMoD Record, vol. 28, pp. 61–72. ACM (1999)

    Google Scholar 

  2. Huang, X., Ye, Y., Guo, H., Cai, Y., Zhang, H., Li, Y.: DSKmeans: a new kmeans-type approach to discriminative subspace clustering. Knowl.-Based Syst. 70, 293–300 (2014)

    Article  Google Scholar 

  3. Gan, G., Ng, M.K.-P.: Subspace clustering using affinity propagation. Pattern Recogn. 48(4), 1455–1464 (2015)

    Article  Google Scholar 

  4. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, vol. 27. ACM (1998)

    Google Scholar 

  5. Goil, S., Nagesh, H., Choudhary, A.: MAFIA: efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 443–452 (1999)

    Google Scholar 

  6. Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 84–93. ACM (1999)

    Google Scholar 

  7. Sequeira, K., Zaki, M.: SCHISM: a new approach for interesting subspace mining. In: Fourth IEEE International Conference on Data Mining 2004, pp. 186–193. IEEE (2004)

    Google Scholar 

  8. Liu, G., Li, J., Sim, K., Wong, L.: Distance based subspace clustering with flexible dimension partitioning. In: IEEE 23rd International Conference on Data Engineering 2007, pp. 1250–1254. IEEE (2007)

    Google Scholar 

  9. Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proceedings of SDM, vol. 4. SIAM (2004)

    Google Scholar 

  10. Zhu, B., Mara, A., Mozo, A.: CLUS: parallel subspace clustering algorithm on spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 175–185. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  11. Assent, I., Krieger, R., Muller, E., Seidl, T.: INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 719–724. IEEE (2008)

    Google Scholar 

  12. Günnemann, S., Boden, B., Seidl, T.: DB-CSC: a density-based approach for subspace clustering in graphs with feature vectors. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part I. LNCS, vol. 6911, pp. 565–580. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  13. Kriegel, H.-P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data (TKDD) 3(1), 1 (2009)

    Google Scholar 

  14. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. Proc. VLDB Endowment 2(1), 1270–1281 (2009)

    Article  Google Scholar 

  15. Parsons, L., Haque, E., Liu, H., et al.: Evaluating subspace clustering algorithms. In: Workshop on Clustering High Dimensional Data and its Applications, SIAM International Conference on Data Mining, pp. 48–56. Citeseer (2004)

    Google Scholar 

  16. Assent, I., Krieger, R., Muller, E., Seidl, T.: DUSC: dimensionality unbiased subspace clustering. In: Seventh IEEE International Conference on Data Mining 2007, pp. 409–414. IEEE (2007)

    Google Scholar 

  17. Kriegel, H.-P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: Fifth IEEE International Conference on Data Mining, p. 8. IEEE (2005)

    Google Scholar 

  18. Zhu, B., Ordozgoiti, B., Mozo, A.: Psceg: an unbiased parallel subspace clustering algorithm using exact grids. ESANN (2016)

    Google Scholar 

Download references

Acknowledgement

The research leading to these results has received funding from the European Union under the FP7 grant agreement no. 619633 (project ONTIC) and H2020 grant agreement no. 671625 (project CogNet)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhu, B., Mozo, A. (2016). Spark2Fires: A New Parallel Approximate Subspace Clustering Algorithm. In: Ivanović, M., et al. New Trends in Databases and Information Systems. ADBIS 2016. Communications in Computer and Information Science, vol 637. Springer, Cham. https://doi.org/10.1007/978-3-319-44066-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44066-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44065-1

  • Online ISBN: 978-3-319-44066-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics