Abstract
Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.
Similar content being viewed by others
References
Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin A, Stoica I, Zaharia M (2009) Above the clouds: A berkeley view of cloud computing. Tech rep, EECS Department. University of California, Berkeley
Batini C, Scannapieco M (2006) Data quality: concepts, methodologies and techniques. Data-centric systems and applications. Springer, Berlin
Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol 3, pp 25–27
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48
Borthakur D (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website
Christen P (2008) Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD, pp 1065–1068
Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647
Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. In: OSDI, pp 137–150
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
DeWitt D, Gray J (1992) Parallel database systems: the future of high performance database systems. Commun ACM 35(6):85–98
DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: VLDB, pp 27–40
Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16
Foundation AS (2006) Hadoop. http://hadoop.apache.org/mapreduce/
Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: SIGMOD Conference, pp 127–138
Kim HS, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292
Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: 8th International Workshop on Quality in Databases
Kolb L, Thor A, Rahm E (2011) Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp 45–64
Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. In: VLDB, pp 484–493
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14:23–31
Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp 495–506
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kolb, L., Thor, A. & Rahm, E. Multi-pass sorted neighborhood blocking with MapReduce. Comput Sci Res Dev 27, 45–63 (2012). https://doi.org/10.1007/s00450-011-0177-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-011-0177-x