An efficient MapReduce algorithm for similarity join in metric spaces

Liu, Wen; Shen, Yanming; Wang, Peng

doi:10.1007/s11227-016-1651-9

An efficient MapReduce algorithm for similarity join in metric spaces

Published: 06 February 2016

Volume 72, pages 1179–1200, (2016)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Wen Liu¹,
Yanming Shen¹ &
Peng Wang²

402 Accesses
7 Citations
Explore all metrics

Abstract

Given a massive set of records, similarity join is to find pairs of records with similarity score greater than a threshold. In this paper, we address the problem of scaling up similarity join for general metric distance functions using MapReduce. First, we propose a novel index structure, Similarity Join Tree (SJT), which partitions data based on the underlying data distribution, and distributes similar records to the same group. Different from existing approaches, SJT can prune a large number of comparisons within reduce tasks by utilizing the by-product results generated in partitioning data. Then, to avoid the straggler reduce tasks, we design a graph partition algorithm by extending the well known Fiduccia–Mattheyses algorithm which can ensure load balancing while minimizing communication cost and redundancy in all reduce tasks. Experimental results using real data sets show that our approach is more effective and scalable compared to state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Article 08 April 2024

H. S. Jennath & S. Asharaf

Range constrained group query on attribute social graph

Article 30 March 2024

Zijun Chen, Wenwen Shao & Wenyuan Liu

The big data system, components, tools, and technologies: a survey

Article 18 September 2018

T. Ramalingeswara Rao, Pabitra Mitra, … A. Goswami

Notes

For the added virtual nodes in SJT\(_C\), the weight is set to 1.

References

Henzinger MR (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 284–291
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw 29(813):1157–1166
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715
Article Google Scholar
Sarma AD, He Y, Chaudhuri S (2014) Clusterjoin: a similarity joins framework using map-reduce. Proc VLDB Endow 7(12):1059–1070
Article Google Scholar
Shim K, Srikant R, Agrawal R (2002) High-dimensional similarity joins. IEEE Trans Knowl Data Eng 14(1):156–171
Article Google Scholar
Wang Y, Metwally A, Parthasarathy S (2013) Scalable all-pairs similarity search in metric spaces. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 829–837
Weber R, Schek H-J, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the VLDB conference, pp 194–205
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: Knowledge discovery and data mining, pp 245–250
Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. ACM SIGMOD 26(2):289–300
Article Google Scholar
Chakrabarti K, Keogh EJ, Mehrotra S, Pazzani MJ (2002) Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans Database Syst 27(2):188–228
Article Google Scholar
Keogh EJ, Pazzani MJ (2000) A simple dimensionality reduction technique for fast similarity search in large time series databases. Knowl Discov Data Min Curr Issues New Appl 122–133
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, pp 495–506
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: ICDE’06. Proceedings of the 22nd international conference on data engineering, p 5
Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15
Article Google Scholar
Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: SIGMOD, pp 949–960
Beyer Kevin S, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proceedings of the ICDT, pp 217–235
Bryant V (1985) Metric spaces: iteration and application. cambridge University Press, Cambridge
MATH Google Scholar
Traina C Jr, Santos Filho RF, Traina AJM, Vieira MR, Faloutsos Christos (2007) The omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB J 16(4):483–505
Article Google Scholar
Chen L, Gao Y, Li X, Jensen CS, Chen G (2015) Efficient metric indexing for similarity search. In: International conference on data engineering (ICDE)
Yang S, Yan X, Zong B, Khan A (2012) Towards effective partition management for large graphs. In: SIGMOD
Bourse F, Lelarge M, Vojnovic M (2014) Balanced graph edge partition. In: KDD ’14—20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1456–1465
Harary F, Norman RZ (1960) Some properties of line digraphs. Rend Circ Mat Palermo 9(2):161–168
Article MathSciNet MATH Google Scholar
Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th Proceedings of the design automation conference, pp 175–181
Newman DJ, Asuncion A (2007) UCI machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html. Accessed 26 Dec 2015
Wikipedia. https://en.wikipedia.org/wiki/Main_Page. Accessed 28 Dec 2015

Download references

Acknowledgments

This work is supported by NSFC under Grant 61173160 and Scientific Research Program of the Higher Education Institution of XinJiang (XJEDU2014S087).

Author information

Authors and Affiliations

School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, People’s Republic of China
Wen Liu & Yanming Shen
School of Computer Science, Fudan University, Shanghai, People’s Republic of China
Peng Wang

Authors

Wen Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yanming Shen
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanming Shen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Shen, Y. & Wang, P. An efficient MapReduce algorithm for similarity join in metric spaces. J Supercomput 72, 1179–1200 (2016). https://doi.org/10.1007/s11227-016-1651-9

Download citation

Published: 06 February 2016
Issue Date: March 2016
DOI: https://doi.org/10.1007/s11227-016-1651-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient MapReduce algorithm for similarity join in metric spaces

Abstract

Access this article

Similar content being viewed by others

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Range constrained group query on attribute social graph

The big data system, components, tools, and technologies: a survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Scalable decision fusion algorithm for enabling decentralized computation in distributed, big data clustering problems

Range constrained group query on attribute social graph

The big data system, components, tools, and technologies: a survey

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation