Multi-pass sorted neighborhood blocking with MapReduce

Kolb, Lars; Thor, Andreas; Rahm, Erhard

doi:10.1007/s00450-011-0177-x

Multi-pass sorted neighborhood blocking with MapReduce

Special Issue Paper
Published: 18 May 2011

Volume 27, pages 45–63, (2012)
Cite this article

Computer Science - Research and Development

Lars Kolb¹,
Andreas Thor¹ &
Erhard Rahm¹

318 Accesses
55 Citations
2 Altmetric
Explore all metrics

Abstract

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin A, Stoica I, Zaharia M (2009) Above the clouds: A berkeley view of cloud computing. Tech rep, EECS Department. University of California, Berkeley
Batini C, Scannapieco M (2006) Data quality: concepts, methodologies and techniques. Data-centric systems and applications. Springer, Berlin
Google Scholar
Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol 3, pp 25–27
Google Scholar
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48
Google Scholar
Borthakur D (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website
Christen P (2008) Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD, pp 1065–1068
Google Scholar
Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647
Google Scholar
Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. In: OSDI, pp 137–150
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
DeWitt D, Gray J (1992) Parallel database systems: the future of high performance database systems. Commun ACM 35(6):85–98
Article Google Scholar
DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: VLDB, pp 27–40
Google Scholar
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Article Google Scholar
Foundation AS (2006) Hadoop. http://hadoop.apache.org/mapreduce/
Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: SIGMOD Conference, pp 127–138
Google Scholar
Kim HS, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292
Google Scholar
Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: 8th International Workshop on Quality in Databases
Google Scholar
Kolb L, Thor A, Rahm E (2011) Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp 45–64
Google Scholar
Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210
Article Google Scholar
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. In: VLDB, pp 484–493
Google Scholar
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14:23–31
Article Google Scholar
Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177
Article Google Scholar
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13
Google Scholar
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp 495–506
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Institut für Informatik, Fakultät für Mathematik und Informatik, Universität Leipzig, PF 100920, 04009, Leipzig, Germany
Lars Kolb, Andreas Thor & Erhard Rahm

Authors

Lars Kolb
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Thor
View author publications
You can also search for this author in PubMed Google Scholar
Erhard Rahm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lars Kolb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kolb, L., Thor, A. & Rahm, E. Multi-pass sorted neighborhood blocking with MapReduce. Comput Sci Res Dev 27, 45–63 (2012). https://doi.org/10.1007/s00450-011-0177-x

Download citation

Published: 18 May 2011
Issue Date: February 2012
DOI: https://doi.org/10.1007/s00450-011-0177-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-pass sorted neighborhood blocking with MapReduce

Abstract

Access this article

Similar content being viewed by others

Reducing partition skew on MapReduce: an incremental allocation approach

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

A MapReduce Reinforced Distributed Sequential Pattern Mining Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-pass sorted neighborhood blocking with MapReduce

Abstract

Access this article

Similar content being viewed by others

Reducing partition skew on MapReduce: an incremental allocation approach

Performance Comparison of Three Spark-Based Implementations of Parallel Entity Resolution

A MapReduce Reinforced Distributed Sequential Pattern Mining Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation