research-article

Evaluation of entity resolution approaches on real-world match problems

Authors:
Hanna Köpcke

University of Leipzig, Germany

University of Leipzig, Germany
View Profile

,
Andreas Thor

University of Leipzig, Germany

University of Leipzig, Germany
View Profile

,
Erhard Rahm

University of Leipzig, Germany

University of Leipzig, Germany
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 484–493https://doi.org/10.14778/1920841.1920904

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

Despite the huge amount of recent research efforts on entity resolution (matching) there has not yet been a comparative evaluation on the relative effectiveness and efficiency of alternate approaches. We therefore present such an evaluation of existing implementations on challenging real-world match tasks. We consider approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community we also consider a state-of-the-art commercial entity resolution implementation. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

References

Ananthakrishna, R., Chaudhuri, S., and Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002 Google ScholarDigital Library
Batini, C., and Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications, Springer, 2006 Google ScholarDigital Library
Baxter, R., Christen, P, and Churches, T.: A comparison of fast blocking methods for record linkage. In Proc of ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003Google Scholar
Bilenko, M. and Mooney, R. J.: Adaptive duplicate detection using learnable string similarity measures. In Proc. of ACM SIGKDD, 2003 Google ScholarDigital Library
Bilenko, M. and Mooney, R. J.: On Evaluation and Training-Set Construction for Duplicate Detection. In Proc. of Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003Google Scholar
de Carvalho, M. G., Gonçalves, M. A., Laender, A. H., and da Silva, A. S.: Learning to deduplicate. In Proc. of JCDL, 2006 Google ScholarDigital Library
Chaudhuri, S., Chen, B.-C., Ganti, V., and Kaushik, R.: Example-driven design of efficient record matching queries. In Proc. of VLDB, 2007 Google ScholarDigital Library
Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues. Technical Report, Australian National University, 2006Google ScholarDigital Library
Christen, P.: FEBRL: a freely available record linkage system with a graphical user interface. In Proc. of HDKM, 2008 Google ScholarDigital Library
Cohen, W. W., Kautz, H. A., and McAllester, D. A.: Hardening soft information sources. In Proc. of Workshop on Information Quality in Information Systems (IQIS), 2005Google Scholar
Cohen, W. W., Ravikumar, P., and Fienberg, S. E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of Workshop on Information Integration on the Web (IIWeb), 2003Google Scholar
Culotta, A., and McCallum, A.: Joint deduplication of multiple record types in relational data. In Proc. of CIKM, 2005 Google ScholarDigital Library
Dong, X., Halevy, A., and Madhavan, J.: Reference reconciliation in complex information spaces. In Proc. of ACM SIGMOD, 2005 Google ScholarDigital Library
Elfeky, M. G., Elmagarmid, A.K., and Verykios, V.S.: TAILOR: A Record Linkage Tool Box. In Proc. of ICDE, 2002 Google ScholarDigital Library
Fellegi, I. P., and Sunter, A. B.: A theory for record linkage. Journal of the American Statistical Association 64 (328), 1969Google ScholarCross Ref
Gu, L., and Baxter, R.: Decision Models for Record Linkage. In Proc. of AusDM, 2006 Google ScholarDigital Library
Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J.: Framework For Evaluating Clustering Algorithms In Duplicate Detection. In Proc. of VLDB, 2009 Google ScholarDigital Library
Hernandez, M. A., and Stolfo, S. J.: The Merge/Purge Problem for Large Databases. In Proc. of ACM SIGMOD, 1995. Google ScholarDigital Library
Heuser, C. A., Krieser, F. N., and Orengo, V. M.: SimEval: a tool for evaluating the quality of similarity functions. In Proc. of Conference on Conceptual Modeling, 2007 Google ScholarDigital Library
Köpcke, H., and Rahm, E.: Frameworks for Entity Matching: A Comparison. Data & Knowledge Engineering, 96(2), 2010 Google ScholarDigital Library
Köpcke, H., and Rahm, E.: Training Selection for Tuning Entity Matching. In Proc. of QDB/MUD workshop, 2008Google Scholar
Köpcke, H., Thor, A., and Rahm, E.: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing, pp. 23--31, July/August, 2010 Google ScholarDigital Library
Köpcke, H., Thor, A., and Rahm, E.: Comparative evaluation of entity resolution approaches with FEVER. In Proc. of VLDB, 2009 (Demo paper)Google ScholarDigital Library
Koudas, N., Sarawagi, S., and Srivastava, D.: Record linkage: Similarity measures and algorithms. In Proc of ACM SIGMOD, 2006 Google ScholarDigital Library
Lu, Q., and Getoor, L.: Link-based Classification using Labeled and Unlabeled Data. In Proc of ICML, 2003Google Scholar
McCallum, A., Nigam, K., and Ungar, L. H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Proc. of ACM SIGKDD, 2000 Google ScholarDigital Library
Michelson, M., and Knoblock, C. A.: Learning blocking schemes for record linkage. In Proc. of AAAI, 2006 Google ScholarDigital Library
Neiling, M., Jurk, S., Lenz, H.-J., and Naumann, F.: Object identification quality. In Proc. of DQCIS, 2003Google Scholar
Rahm, E., and Do, H.-H.: Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4), 2000Google Scholar
Singla, P., and Domingos, P.: Object Identification with Attribute-Mediated Dependences. In Proc. of PKDD, 2005 Google ScholarCross Ref
Thor, A., and Rahm, E.: MOMA - A Mapping-based Object Matching System. In Proc. of CIDR, 2007Google Scholar
Weis, M., Naumann, F., and Brosy, F.: A Duplicate Detection Benchmark for XML (and Relational) Data. In Proc. of Workshop on Information Quality for Information Systems (IQIS), 2006Google Scholar
Weis, N. and Naumann, F.: DogmatiX tracks down Duplicated in XML. In Proc. of ACM SIGMOD, 2005 Google ScholarDigital Library
Xiao, C., Wang, W., Lin, X., and Yu, J. X.: Efficient Similarity Joins for Near Duplicate Detection. In Proc. of WWW, 2008 Google ScholarDigital Library

Recommendations

Similarity-aware indexing for real-time entity resolution
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Entity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static ...
Read More
Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Read More
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 130
  Total Citations
  View Citations
- 1,654
  Total Downloads
- Downloads (Last 12 months)107
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Similarity-aware indexing for real-time entity resolution

Handling data quality in entity resolution

Collective entity resolution in relational data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Similarity-aware indexing for real-time entity resolution

Handling data quality in entity resolution

Collective entity resolution in relational data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media