Skip to main content
Log in

Multi-pass sorted neighborhood blocking with MapReduce

  • Special Issue Paper
  • Published:
Computer Science - Research and Development

Abstract

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReduce-based implementations for single- and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitioning approach for multi-pass SN to achieve load balancing. Our evaluation based on real-world datasets shows the high efficiency and effectiveness of the proposed approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Armbrust M, Fox A, Griffith R, Joseph AD, Katz RH, Konwinski A, Lee G, Patterson DA, Rabkin A, Stoica I, Zaharia M (2009) Above the clouds: A berkeley view of cloud computing. Tech rep, EECS Department. University of California, Berkeley

  2. Batini C, Scannapieco M (2006) Data quality: concepts, methodologies and techniques. Data-centric systems and applications. Springer, Berlin

    Google Scholar 

  3. Baxter R, Christen P, Churches T (2003) A comparison of fast blocking methods for record linkage. In: ACM SIGKDD, vol 3, pp 25–27

    Google Scholar 

  4. Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp 39–48

    Google Scholar 

  5. Borthakur D (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website

  6. Christen P (2008) Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD, pp 1065–1068

    Google Scholar 

  7. Christen P, Churches T, Hegland M (2004) Febrl—a parallel open source data linkage system. In: PAKDD, pp 638–647

    Google Scholar 

  8. Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. In: OSDI, pp 137–150

    Google Scholar 

  9. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  10. DeWitt D, Gray J (1992) Parallel database systems: the future of high performance database systems. Commun ACM 35(6):85–98

    Article  Google Scholar 

  11. DeWitt DJ, Naughton JF, Schneider DA, Seshadri S (1992) Practical skew handling in parallel joins. In: VLDB, pp 27–40

    Google Scholar 

  12. Elmagarmid AK, Ipeirotis PG, Verykios VS (2007) Duplicate record detection: a survey. IEEE Trans Knowl Data Eng 19(1):1–16

    Article  Google Scholar 

  13. Foundation AS (2006) Hadoop. http://hadoop.apache.org/mapreduce/

  14. Hernández MA, Stolfo SJ (1995) The merge/purge problem for large databases. In: SIGMOD Conference, pp 127–138

    Google Scholar 

  15. Kim HS, Lee D (2007) Parallel linkage. In: CIKM, pp 283–292

    Google Scholar 

  16. Kirsten T, Kolb L, Hartung M, Gross A, Köpcke H, Rahm E (2010) Data partitioning for parallel entity matching. In: 8th International Workshop on Quality in Databases

    Google Scholar 

  17. Kolb L, Thor A, Rahm E (2011) Parallel sorted neighborhood blocking with mapreduce. In: BTW, pp 45–64

    Google Scholar 

  18. Köpcke H, Rahm E (2010) Frameworks for entity matching: a comparison. Data Knowl Eng 69(2):197–210

    Article  Google Scholar 

  19. Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. In: VLDB, pp 484–493

    Google Scholar 

  20. Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Comput 14:23–31

    Article  Google Scholar 

  21. Lin J, Dyer C (2010) Data-intensive text processing with mapreduce. Synth Lect Hum Lang Technol 3(1):1–177

    Article  Google Scholar 

  22. Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23(4):3–13

    Google Scholar 

  23. Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: SIGMOD Conference, pp 495–506

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lars Kolb.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kolb, L., Thor, A. & Rahm, E. Multi-pass sorted neighborhood blocking with MapReduce. Comput Sci Res Dev 27, 45–63 (2012). https://doi.org/10.1007/s00450-011-0177-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-011-0177-x

Keywords

Navigation