Skip to main content
Log in

An efficient MapReduce algorithm for similarity join in metric spaces

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia–Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. For the added virtual nodes in SJT\(_C\), the weight is set to 1.

References

  1. Henzinger MR (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 284–291

  2. Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(813):1157–1166

  3. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  4. Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715

    Article  Google Scholar 

  5. Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. Proc VLDB Endow 7(12):1059–1070

    Article  Google Scholar 

  6. Shim K, Srikant R, Agrawal R (2002) High-dimensional similarity joins. IEEE Trans Knowl Data Eng 14(1):156–171

    Article  Google Scholar 

  7. Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 829–837

  8. Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the VLDB conference, pp 194–205

  9. Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Knowledge discovery and data mining, pp 245–250

  10. Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD 26(2):289–300

    Article  Google Scholar 

  11. Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228

    Article  Google Scholar 

  12. Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. Knowl Discov Data Min Curr Issues New Appl 122–133

  13. Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 495–506

  14. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE’06. Proceedings of the 22nd international conference on data engineering, p 5

  15. Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15

    Article  Google Scholar 

  16. Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960

  17. Beyer Kevin S, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the ICDT, pp 217–235

  18. Bryant V (1985) Metric spaces: iteration and application. cambridge University Press, Cambridge

    MATH  Google Scholar 

  19. Traina C Jr, Santos Filho RF, Traina AJM, Vieira MR, Faloutsos Christos (2007) The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J 16(4):483–505

    Article  Google Scholar 

  20. Chen L, Gao Y, Li X, Jensen CS, Chen G (2015) Efficient metric indexing for similarity search. In: International conference on data engineering (ICDE)

  21. Yang S, Yan X, Zong B, Khan A (2012) Towards effective partition management for large graphs. In: SIGMOD

  22. Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: KDD ’14—20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1456–1465

  23. Harary F, Norman RZ (1960) Some properties of line digraphs. Rend Circ Mat Palermo 9(2):161–168

    Article  MathSciNet  MATH  Google Scholar 

  24. Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th Proceedings of the design automation conference, pp 175–181

  25. Newman DJ, Asuncion A (2007) UCI machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html. Accessed 26 Dec 2015

  26. Wikipedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 28 Dec 2015

Download references

Acknowledgments

This work is supported by NSFC under Grant 61173160 and Scientific Research Program of the Higher Education Institution of XinJiang (XJEDU2014S087).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanming Shen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Shen, Y. & Wang, P. An efficient MapReduce algorithm for similarity join in metric spaces. J Supercomput 72, 1179–1200 (2016). https://doi.org/10.1007/s11227-016-1651-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1651-9

Keywords

Navigation