Abstract
Entity resolution is the process of determining if, in a specific context, two or more references correspond to the same entity. In this work, we address this problem in the context of references to persons as they are found in bibliographic data, specifically in the case of consolidating multiple datasets. Or solution follows the extraction, transformation and loading (ETL) process, typical in data warehouses. It computes the similarities of the attribute values for the references, and employs a decision tree to decide when the references match. We describe the characteristics of these references within bibliographic datasets, and how we explored those characteristics by developing new similarity metrics to improve the quality of the consolidation process. We evaluated our work by designing an experiment with data from four national libraries. The results show that the proposed similarity metrics contribute significantly to the consolidation process.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate Record Detection: A Survey. IEEE Transactions on knowledge and data engineering 19(1), 1–16 (2007)
Dong, X., Halevy, A., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international Conference on Management of Data. SIGMOD 2005, pp. 85–96. ACM, New York (2005)
Chen, Z., Kalashnikov, D.V., Mehrotra, S.: Exploiting relationships for object consolidation. In: IQIS 2005, pp. 47–58. ACM, New York (2005)
ALA, CLA, CILIP. Anglo-American Cataloguing Rules: 2002 Revision (2002)
Kaiser, M., Lieder, H.J., Majcen, K., Vallant, H.: New Ways of Sharing and Using Authority Information. D-Lib Magazine 9(11) (2003), http://www.dlib.org/dlib/november03/lieder/11lieder.html
Lawrence, S., Giles, C.L., Bollacker, K.D.: Autonomous Citation Matching. In: Proceedings of the Third International Conference on Autonomous Agents. ACM, New York (1999)
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity Uncertainty and Citation Matching. In: Advances in Neural Information Processing (2002)
Martins, B., Manguinhas, H., Borbinha, J.: Extracting and Exploring Semantic Geographical Information from Textual Resources. In: Proceedings of the Second IEEE International Conference on Semantic Computing (ICSC) (2008)
Manguinhas, H., Martins, B., Borbinha, J., Siabato, W.: The DIGMAP Geo-Temporal Web Gazetteer Service. In: Third ICA Workshop on Digital Approaches to Cartographic Heritage (2008)
Jaro, M.A.: Advances in record linking methodology as applied to the 1985 census of Tampa Florida. Journal of the American Statistical Society 64, 1183–1210 (1989)
Freund, Y., Mason, L.: The Alternating Decision Tree Algorithm. In: Proceedings of the 16th International Conference on Machine Learning, pp. 124–133 (1999)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 707–710 (1966)
Martins, B., Freire, N., Borbinha, J.: Using XML Technologies for Complex Data Transformations in Geo-referenced Digital Libraries. In: International Conference on Asia-Pacific Digital Libraries 2008 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Freire, N., Borbinha, J., Martins, B. (2008). Consolidation of References to Persons in Bibliographic Databases. In: Buchanan, G., Masoodian, M., Cunningham, S.J. (eds) Digital Libraries: Universal and Ubiquitous Access to Information. ICADL 2008. Lecture Notes in Computer Science, vol 5362. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-89533-6_26
Download citation
DOI: https://doi.org/10.1007/978-3-540-89533-6_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-89532-9
Online ISBN: 978-3-540-89533-6
eBook Packages: Computer ScienceComputer Science (R0)