tutorial

A Comparison of String Similarity Measures for Toponym Matching

COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of PlaceNovember 2013Pages 54–61https://doi.org/10.1145/2534848.2534850

Published:03 October 2013Publication History

COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place

Pages 54–61

ABSTRACT

The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.

References

Anastácio, I., Martins, B., and Calado, P. 2011. Supervised learning for linking named entities to knowledge base entries. In Proceedings of Text Analysis Conference (Gaithersburg, Maryland, November 14--15, 2011). KBP '11. National Institute of Standards and Technology, Gaithersburg, MD, n.p.Google Scholar
Bartolini, I., Ciaccia, P., and Patella, M. 2002. String matching with metric trees using an approximate distance. In String Processing & Information Retrieval (SPIRE), Lecture Notes in Computer Science, 2476, 271--283, Lisbon, Portugal. Google ScholarDigital Library
Benedetto, D., Caglioti, E., and Loreto, V. 2002. Language trees and zipping. Physical Review Letters, 88, 4, 048702. DOI=10.1103/PhysRevLett.88.048702.Google ScholarCross Ref
Central Intelligence Agency. 2013. The World Factbook 2013--14. Washington, DC.Google Scholar
Christen, P. 2006. A comparison of personal name matching: Techniques and practical issues. In Data Mining Workshops, Sixth IEEE International Conference on Data Mining (Hong Kong, December 18--22, 2006). IEEE, New York, 290--294. Google ScholarDigital Library
Christen, P., Churches, T., and Hegland, M. 2004. Febrl -- a parallel open source data linkage system. In Pacific Asia Knowledge Discovery and Data Mining (Sydney, Australia, May 20--26, 2004). Springer, New York, 638--647.Google Scholar
Cilibrasi, R. and Vitányi, P. M. B. 2005. Clustering by compression. IEEE Transactions on Information Theory, 51, 4, 1523--1545. Google ScholarDigital Library
Cilibrasi, R. and Vitányi, P. M. B. 2007. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 19, 370--383. Google ScholarDigital Library
Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of 2003 International Joint Conferences on Artificial Intelligence (IJCAI-03) Workshop on Information Integration on the Web (Acapulco, Mexico, August 9--15, 2003). Morgan Kaufmann, San Francisco, 73--78.Google Scholar
Costello, A. B., and Osborne, J. W. 2005. Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10, 7, 1--9.Google Scholar
Cox, G. E., Kachergis, G., Recchia, G., and Jones, M. N. 2011. Toward a scalable holographic word-form representation. Behavior Research Methods, 43, 3, 602--615.Google ScholarCross Ref
Damerau, F. J. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 3, 171--176. Google ScholarDigital Library
Friedman, C. and Sideli, R. 1992. Tolerating spelling errors during patient validation. Computers and Biomedical Research, 25, 486--509. Google ScholarDigital Library
Gadd, T. 1990. PHONIX: The algorithm. Program: Automated Library and Information Systems, 24, 4, 363--366. Google ScholarDigital Library
Gong, R. and Chan, T. K. 2006. Syllable alignment: A novel model for phonetic string search. Institute of Electronics, Information and Communication Engineers (IEICE) Transactions on Information and Systems, E89-D, 1, 332--339. Google ScholarDigital Library
Hastings, J. T. 2008. Automated conflation of digital gazetteer data. International Journal of Geographical Information Science, 22, 10, 1109--1127. Google ScholarDigital Library
Hastings, J. T. and Hill, L. L. 2002. Treatment of 'duplicates' in the Alexandria Digital Library gazetteer. In M.J. Egenhofer, and D.M. Mark (Eds.), Geographic Information Science, Second International Conference (Extended Abstracts) (September 25--28, Boulder, Colorado, 2002). Springer, New York, 64--65.Google Scholar
Jaro, M. A. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 89, 414--420.Google ScholarCross Ref
Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., and Jarvelin, K. 2003. Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proccedings of String Processing & Information Retrieval (SPIRE) (Manaus, Brazil, October 8--10, 2003). Springer, New York, 252--265.Google Scholar
Lennon, M., Peirce, D. S., Tarry, B. D., and Willett, P. 1981. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3, 4, 177--183.Google ScholarCross Ref
Levenshtein, V. I. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707--710.Google Scholar
Maki, W. S., and Buchanan, E. 2008. Latent structure in measures of associative, semantic, and thematic knowledge. Psychonomic Bulletin & Review, 15, 3, 598--603.Google ScholarCross Ref
Martins, B. 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In Proceedings of the 4th International Conference on Geospatial Semantics (Brest, France, May 12--13, 2011). Springer, Berlin Heidelberg, 34--51. Google ScholarDigital Library
Monge, A. E. and Elkan, C. P. 1996. The field-matching problem: Algorithm and applications. In Proceedings of ACM SIGKDD (Portland, Oregon, August 4--8, 1996). AAAI Press, Menlo Park, California, 267--270.Google Scholar
Nadeau, D., and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 1, 3--26.Google ScholarDigital Library
Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys, 33, 1, 31--88. Google ScholarDigital Library
Porter, E. H. and Winkler, W. E. 1997. Approximate String Comparison and Its Effect on an Advanced Record Linkage System. Technical Report. US Bureau of the Census.Google Scholar
Sehgal, V., Getoor, L., and Viechnicki, P. D. 2006, November. Entity resolution in geospatial data integration. In Proceedings of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems (Arlington, Virginia, November 10--11, 2006). ACM, New York, NY, 83--90. Google ScholarDigital Library
Smith, T. F. and Waterman, M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195--197.Google Scholar
Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60, 3, 538--556. Google ScholarDigital Library
Whitney, C. 2001. How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychonomic Bulletin & Review, 8, 221--243.Google ScholarCross Ref
Winkler, W. E. 2006. Overview of Record Linkage and Current Research Directions. Technical Report. US Bureau of the Census.Google Scholar
Zheng, Y., Fen, X., Xie, X., Peng, S., & Fu, J. 2010. Detecting nearly duplicated records in location datasets. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in GIS (San Jose, California, November 2--5, 2010). ACM, New York, NY, 137--143. Google ScholarDigital Library
Zobel, J. and Dart, P. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of ACM SIGIR (Zurich, Switzerland, August 18--22, 1996). ACM, New York, NY, 166--172 Google ScholarDigital Library

Index Terms

A Comparison of String Similarity Measures for Toponym Matching

Recommendations

Toponym disambiguation using ontology-based semantic similarity
PROPOR'12: Proceedings of the 10th international conference on Computational Processing of the Portuguese Language

We propose a new heuristic for toponym sense disambiguation, to be used when mapping toponyms in text to ontology concepts, using techniques based on semantic similarity measures. We evaluated the proposed approach using a collection of Portuguese news ...
Read More
Framework for syntactic string similarity measures
Highlights
- Token-level measures outperform character-level measures when the order of the words varies.
Abstract
Similarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for ...
Read More
Name Similarity for Composite Element Name Matching
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

Background and Objective: Matching corresponding data elements is a critical problem in biomedical data harmonization for data sharing. The similarity of the element names is one of the many factors employed in determining data element matches. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
November 2013
75 pages
ISBN:9781450325356
DOI:10.1145/2534848

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data integration
duplicate detection
edit distance
gazetteers
geographic information retrieval
string similarity
toponyms
Qualifiers
- tutorial
- Research
- Refereed limited
Conference

Acceptance Rates
COMP '13 Paper Acceptance Rate8of14submissions,57%Overall Acceptance Rate8of14submissions,57%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 391
  Total Downloads
- Downloads (Last 12 months)35
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Comparison of String Similarity Measures for Toponym Matching

COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toponym disambiguation using ontology-based semantic similarity

Framework for syntactic string similarity measures

Name Similarity for Composite Element Name Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Comparison of String Similarity Measures for Toponym Matching

COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toponym disambiguation using ontology-based semantic similarity

Framework for syntactic string similarity measures

Name Similarity for Composite Element Name Matching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media