skip to main content
research-article

Evaluation of entity resolution approaches on real-world match problems

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Despite the huge amount of recent research efforts on entity resolution (matching) there has not yet been a comparative evaluation on the relative effectiveness and efficiency of alternate approaches. We therefore present such an evaluation of existing implementations on challenging real-world match tasks. We consider approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community we also consider a state-of-the-art commercial entity resolution implementation. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

References

  1. Ananthakrishna, R., Chaudhuri, S., and Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Batini, C., and Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications, Springer, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baxter, R., Christen, P, and Churches, T.: A comparison of fast blocking methods for record linkage. In Proc of ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003Google ScholarGoogle Scholar
  4. Bilenko, M. and Mooney, R. J.: Adaptive duplicate detection using learnable string similarity measures. In Proc. of ACM SIGKDD, 2003 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bilenko, M. and Mooney, R. J.: On Evaluation and Training-Set Construction for Duplicate Detection. In Proc. of Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003Google ScholarGoogle Scholar
  6. de Carvalho, M. G., Gonçalves, M. A., Laender, A. H., and da Silva, A. S.: Learning to deduplicate. In Proc. of JCDL, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chaudhuri, S., Chen, B.-C., Ganti, V., and Kaushik, R.: Example-driven design of efficient record matching queries. In Proc. of VLDB, 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues. Technical Report, Australian National University, 2006Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christen, P.: FEBRL: a freely available record linkage system with a graphical user interface. In Proc. of HDKM, 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cohen, W. W., Kautz, H. A., and McAllester, D. A.: Hardening soft information sources. In Proc. of Workshop on Information Quality in Information Systems (IQIS), 2005Google ScholarGoogle Scholar
  11. Cohen, W. W., Ravikumar, P., and Fienberg, S. E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of Workshop on Information Integration on the Web (IIWeb), 2003Google ScholarGoogle Scholar
  12. Culotta, A., and McCallum, A.: Joint deduplication of multiple record types in relational data. In Proc. of CIKM, 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dong, X., Halevy, A., and Madhavan, J.: Reference reconciliation in complex information spaces. In Proc. of ACM SIGMOD, 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Elfeky, M. G., Elmagarmid, A.K., and Verykios, V.S.: TAILOR: A Record Linkage Tool Box. In Proc. of ICDE, 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fellegi, I. P., and Sunter, A. B.: A theory for record linkage. Journal of the American Statistical Association 64 (328), 1969Google ScholarGoogle ScholarCross RefCross Ref
  16. Gu, L., and Baxter, R.: Decision Models for Record Linkage. In Proc. of AusDM, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J.: Framework For Evaluating Clustering Algorithms In Duplicate Detection. In Proc. of VLDB, 2009 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hernandez, M. A., and Stolfo, S. J.: The Merge/Purge Problem for Large Databases. In Proc. of ACM SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Heuser, C. A., Krieser, F. N., and Orengo, V. M.: SimEval: a tool for evaluating the quality of similarity functions. In Proc. of Conference on Conceptual Modeling, 2007 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Köpcke, H., and Rahm, E.: Frameworks for Entity Matching: A Comparison. Data & Knowledge Engineering, 96(2), 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Köpcke, H., and Rahm, E.: Training Selection for Tuning Entity Matching. In Proc. of QDB/MUD workshop, 2008Google ScholarGoogle Scholar
  22. Köpcke, H., Thor, A., and Rahm, E.: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing, pp. 23--31, July/August, 2010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Köpcke, H., Thor, A., and Rahm, E.: Comparative evaluation of entity resolution approaches with FEVER. In Proc. of VLDB, 2009 (Demo paper)Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Koudas, N., Sarawagi, S., and Srivastava, D.: Record linkage: Similarity measures and algorithms. In Proc of ACM SIGMOD, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lu, Q., and Getoor, L.: Link-based Classification using Labeled and Unlabeled Data. In Proc of ICML, 2003Google ScholarGoogle Scholar
  26. McCallum, A., Nigam, K., and Ungar, L. H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Proc. of ACM SIGKDD, 2000 Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michelson, M., and Knoblock, C. A.: Learning blocking schemes for record linkage. In Proc. of AAAI, 2006 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Neiling, M., Jurk, S., Lenz, H.-J., and Naumann, F.: Object identification quality. In Proc. of DQCIS, 2003Google ScholarGoogle Scholar
  29. Rahm, E., and Do, H.-H.: Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4), 2000Google ScholarGoogle Scholar
  30. Singla, P., and Domingos, P.: Object Identification with Attribute-Mediated Dependences. In Proc. of PKDD, 2005 Google ScholarGoogle ScholarCross RefCross Ref
  31. Thor, A., and Rahm, E.: MOMA - A Mapping-based Object Matching System. In Proc. of CIDR, 2007Google ScholarGoogle Scholar
  32. Weis, M., Naumann, F., and Brosy, F.: A Duplicate Detection Benchmark for XML (and Relational) Data. In Proc. of Workshop on Information Quality for Information Systems (IQIS), 2006Google ScholarGoogle Scholar
  33. Weis, N. and Naumann, F.: DogmatiX tracks down Duplicated in XML. In Proc. of ACM SIGMOD, 2005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xiao, C., Wang, W., Lin, X., and Yu, J. X.: Efficient Similarity Joins for Near Duplicate Detection. In Proc. of WWW, 2008 Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader