Abstract
Despite the huge amount of recent research efforts on entity resolution (matching) there has not yet been a comparative evaluation on the relative effectiveness and efficiency of alternate approaches. We therefore present such an evaluation of existing implementations on challenging real-world match tasks. We consider approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community we also consider a state-of-the-art commercial entity resolution implementation. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.
- Ananthakrishna, R., Chaudhuri, S., and Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In Proc. of VLDB, 2002 Google ScholarDigital Library
- Batini, C., and Scannapieco, M.: Data Quality: Concepts, Methodologies and Techniques, Data-Centric Systems and Applications, Springer, 2006 Google ScholarDigital Library
- Baxter, R., Christen, P, and Churches, T.: A comparison of fast blocking methods for record linkage. In Proc of ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003Google Scholar
- Bilenko, M. and Mooney, R. J.: Adaptive duplicate detection using learnable string similarity measures. In Proc. of ACM SIGKDD, 2003 Google ScholarDigital Library
- Bilenko, M. and Mooney, R. J.: On Evaluation and Training-Set Construction for Duplicate Detection. In Proc. of Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 2003Google Scholar
- de Carvalho, M. G., Gonçalves, M. A., Laender, A. H., and da Silva, A. S.: Learning to deduplicate. In Proc. of JCDL, 2006 Google ScholarDigital Library
- Chaudhuri, S., Chen, B.-C., Ganti, V., and Kaushik, R.: Example-driven design of efficient record matching queries. In Proc. of VLDB, 2007 Google ScholarDigital Library
- Christen, P.: A Comparison of Personal Name Matching: Techniques and Practical Issues. Technical Report, Australian National University, 2006Google ScholarDigital Library
- Christen, P.: FEBRL: a freely available record linkage system with a graphical user interface. In Proc. of HDKM, 2008 Google ScholarDigital Library
- Cohen, W. W., Kautz, H. A., and McAllester, D. A.: Hardening soft information sources. In Proc. of Workshop on Information Quality in Information Systems (IQIS), 2005Google Scholar
- Cohen, W. W., Ravikumar, P., and Fienberg, S. E.: A Comparison of String Distance Metrics for Name-Matching Tasks. In Proc. of Workshop on Information Integration on the Web (IIWeb), 2003Google Scholar
- Culotta, A., and McCallum, A.: Joint deduplication of multiple record types in relational data. In Proc. of CIKM, 2005 Google ScholarDigital Library
- Dong, X., Halevy, A., and Madhavan, J.: Reference reconciliation in complex information spaces. In Proc. of ACM SIGMOD, 2005 Google ScholarDigital Library
- Elfeky, M. G., Elmagarmid, A.K., and Verykios, V.S.: TAILOR: A Record Linkage Tool Box. In Proc. of ICDE, 2002 Google ScholarDigital Library
- Fellegi, I. P., and Sunter, A. B.: A theory for record linkage. Journal of the American Statistical Association 64 (328), 1969Google ScholarCross Ref
- Gu, L., and Baxter, R.: Decision Models for Record Linkage. In Proc. of AusDM, 2006 Google ScholarDigital Library
- Hassanzadeh, O., Chiang, F., Lee, H. C., and Miller, R. J.: Framework For Evaluating Clustering Algorithms In Duplicate Detection. In Proc. of VLDB, 2009 Google ScholarDigital Library
- Hernandez, M. A., and Stolfo, S. J.: The Merge/Purge Problem for Large Databases. In Proc. of ACM SIGMOD, 1995. Google ScholarDigital Library
- Heuser, C. A., Krieser, F. N., and Orengo, V. M.: SimEval: a tool for evaluating the quality of similarity functions. In Proc. of Conference on Conceptual Modeling, 2007 Google ScholarDigital Library
- Köpcke, H., and Rahm, E.: Frameworks for Entity Matching: A Comparison. Data & Knowledge Engineering, 96(2), 2010 Google ScholarDigital Library
- Köpcke, H., and Rahm, E.: Training Selection for Tuning Entity Matching. In Proc. of QDB/MUD workshop, 2008Google Scholar
- Köpcke, H., Thor, A., and Rahm, E.: Learning-Based Approaches for Matching Web Data Entities. IEEE Internet Computing, pp. 23--31, July/August, 2010 Google ScholarDigital Library
- Köpcke, H., Thor, A., and Rahm, E.: Comparative evaluation of entity resolution approaches with FEVER. In Proc. of VLDB, 2009 (Demo paper)Google ScholarDigital Library
- Koudas, N., Sarawagi, S., and Srivastava, D.: Record linkage: Similarity measures and algorithms. In Proc of ACM SIGMOD, 2006 Google ScholarDigital Library
- Lu, Q., and Getoor, L.: Link-based Classification using Labeled and Unlabeled Data. In Proc of ICML, 2003Google Scholar
- McCallum, A., Nigam, K., and Ungar, L. H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In Proc. of ACM SIGKDD, 2000 Google ScholarDigital Library
- Michelson, M., and Knoblock, C. A.: Learning blocking schemes for record linkage. In Proc. of AAAI, 2006 Google ScholarDigital Library
- Neiling, M., Jurk, S., Lenz, H.-J., and Naumann, F.: Object identification quality. In Proc. of DQCIS, 2003Google Scholar
- Rahm, E., and Do, H.-H.: Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4), 2000Google Scholar
- Singla, P., and Domingos, P.: Object Identification with Attribute-Mediated Dependences. In Proc. of PKDD, 2005 Google ScholarCross Ref
- Thor, A., and Rahm, E.: MOMA - A Mapping-based Object Matching System. In Proc. of CIDR, 2007Google Scholar
- Weis, M., Naumann, F., and Brosy, F.: A Duplicate Detection Benchmark for XML (and Relational) Data. In Proc. of Workshop on Information Quality for Information Systems (IQIS), 2006Google Scholar
- Weis, N. and Naumann, F.: DogmatiX tracks down Duplicated in XML. In Proc. of ACM SIGMOD, 2005 Google ScholarDigital Library
- Xiao, C., Wang, W., Lin, X., and Yu, J. X.: Efficient Similarity Joins for Near Duplicate Detection. In Proc. of WWW, 2008 Google ScholarDigital Library
Recommendations
Similarity-aware indexing for real-time entity resolution
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementEntity resolution, also known as data matching or record linkage, is the task of identifying and matching records from several databases that refer to the same entities. Traditionally, entity resolution has been applied in batch-mode and on static ...
Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systemsEntity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Collective entity resolution in relational data
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Comments