ABSTRACT
The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.
- Anastácio, I., Martins, B., and Calado, P. 2011. Supervised learning for linking named entities to knowledge base entries. In Proceedings of Text Analysis Conference (Gaithersburg, Maryland, November 14--15, 2011). KBP '11. National Institute of Standards and Technology, Gaithersburg, MD, n.p.Google Scholar
- Bartolini, I., Ciaccia, P., and Patella, M. 2002. String matching with metric trees using an approximate distance. In String Processing & Information Retrieval (SPIRE), Lecture Notes in Computer Science, 2476, 271--283, Lisbon, Portugal. Google ScholarDigital Library
- Benedetto, D., Caglioti, E., and Loreto, V. 2002. Language trees and zipping. Physical Review Letters, 88, 4, 048702. DOI=10.1103/PhysRevLett.88.048702.Google ScholarCross Ref
- Central Intelligence Agency. 2013. The World Factbook 2013--14. Washington, DC.Google Scholar
- Christen, P. 2006. A comparison of personal name matching: Techniques and practical issues. In Data Mining Workshops, Sixth IEEE International Conference on Data Mining (Hong Kong, December 18--22, 2006). IEEE, New York, 290--294. Google ScholarDigital Library
- Christen, P., Churches, T., and Hegland, M. 2004. Febrl -- a parallel open source data linkage system. In Pacific Asia Knowledge Discovery and Data Mining (Sydney, Australia, May 20--26, 2004). Springer, New York, 638--647.Google Scholar
- Cilibrasi, R. and Vitányi, P. M. B. 2005. Clustering by compression. IEEE Transactions on Information Theory, 51, 4, 1523--1545. Google ScholarDigital Library
- Cilibrasi, R. and Vitányi, P. M. B. 2007. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 19, 370--383. Google ScholarDigital Library
- Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of 2003 International Joint Conferences on Artificial Intelligence (IJCAI-03) Workshop on Information Integration on the Web (Acapulco, Mexico, August 9--15, 2003). Morgan Kaufmann, San Francisco, 73--78.Google Scholar
- Costello, A. B., and Osborne, J. W. 2005. Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10, 7, 1--9.Google Scholar
- Cox, G. E., Kachergis, G., Recchia, G., and Jones, M. N. 2011. Toward a scalable holographic word-form representation. Behavior Research Methods, 43, 3, 602--615.Google ScholarCross Ref
- Damerau, F. J. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 3, 171--176. Google ScholarDigital Library
- Friedman, C. and Sideli, R. 1992. Tolerating spelling errors during patient validation. Computers and Biomedical Research, 25, 486--509. Google ScholarDigital Library
- Gadd, T. 1990. PHONIX: The algorithm. Program: Automated Library and Information Systems, 24, 4, 363--366. Google ScholarDigital Library
- Gong, R. and Chan, T. K. 2006. Syllable alignment: A novel model for phonetic string search. Institute of Electronics, Information and Communication Engineers (IEICE) Transactions on Information and Systems, E89-D, 1, 332--339. Google ScholarDigital Library
- Hastings, J. T. 2008. Automated conflation of digital gazetteer data. International Journal of Geographical Information Science, 22, 10, 1109--1127. Google ScholarDigital Library
- Hastings, J. T. and Hill, L. L. 2002. Treatment of 'duplicates' in the Alexandria Digital Library gazetteer. In M.J. Egenhofer, and D.M. Mark (Eds.), Geographic Information Science, Second International Conference (Extended Abstracts) (September 25--28, Boulder, Colorado, 2002). Springer, New York, 64--65.Google Scholar
- Jaro, M. A. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 89, 414--420.Google ScholarCross Ref
- Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., and Jarvelin, K. 2003. Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proccedings of String Processing & Information Retrieval (SPIRE) (Manaus, Brazil, October 8--10, 2003). Springer, New York, 252--265.Google Scholar
- Lennon, M., Peirce, D. S., Tarry, B. D., and Willett, P. 1981. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3, 4, 177--183.Google ScholarCross Ref
- Levenshtein, V. I. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707--710.Google Scholar
- Maki, W. S., and Buchanan, E. 2008. Latent structure in measures of associative, semantic, and thematic knowledge. Psychonomic Bulletin & Review, 15, 3, 598--603.Google ScholarCross Ref
- Martins, B. 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In Proceedings of the 4th International Conference on Geospatial Semantics (Brest, France, May 12--13, 2011). Springer, Berlin Heidelberg, 34--51. Google ScholarDigital Library
- Monge, A. E. and Elkan, C. P. 1996. The field-matching problem: Algorithm and applications. In Proceedings of ACM SIGKDD (Portland, Oregon, August 4--8, 1996). AAAI Press, Menlo Park, California, 267--270.Google Scholar
- Nadeau, D., and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 1, 3--26.Google ScholarDigital Library
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys, 33, 1, 31--88. Google ScholarDigital Library
- Porter, E. H. and Winkler, W. E. 1997. Approximate String Comparison and Its Effect on an Advanced Record Linkage System. Technical Report. US Bureau of the Census.Google Scholar
- Sehgal, V., Getoor, L., and Viechnicki, P. D. 2006, November. Entity resolution in geospatial data integration. In Proceedings of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems (Arlington, Virginia, November 10--11, 2006). ACM, New York, NY, 83--90. Google ScholarDigital Library
- Smith, T. F. and Waterman, M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195--197.Google Scholar
- Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60, 3, 538--556. Google ScholarDigital Library
- Whitney, C. 2001. How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychonomic Bulletin & Review, 8, 221--243.Google ScholarCross Ref
- Winkler, W. E. 2006. Overview of Record Linkage and Current Research Directions. Technical Report. US Bureau of the Census.Google Scholar
- Zheng, Y., Fen, X., Xie, X., Peng, S., & Fu, J. 2010. Detecting nearly duplicated records in location datasets. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in GIS (San Jose, California, November 2--5, 2010). ACM, New York, NY, 137--143. Google ScholarDigital Library
- Zobel, J. and Dart, P. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of ACM SIGIR (Zurich, Switzerland, August 18--22, 1996). ACM, New York, NY, 166--172 Google ScholarDigital Library
Index Terms
- A Comparison of String Similarity Measures for Toponym Matching
Recommendations
Toponym disambiguation using ontology-based semantic similarity
PROPOR'12: Proceedings of the 10th international conference on Computational Processing of the Portuguese LanguageWe propose a new heuristic for toponym sense disambiguation, to be used when mapping toponyms in text to ontology concepts, using techniques based on semantic similarity measures. We evaluated the proposed approach using a collection of Portuguese news ...
Framework for syntactic string similarity measures
Highlights- Token-level measures outperform character-level measures when the order of the words varies.
AbstractSimilarity measure is an essential component of information retrieval, document clustering, text summarization, and question answering, among others. In this paper, we introduce a general framework of syntactic similarity measures for ...
Name Similarity for Composite Element Name Matching
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health InformaticsBackground and Objective: Matching corresponding data elements is a critical problem in biomedical data harmonization for data sharing. The similarity of the element names is one of the many factors employed in determining data element matches. ...
Comments