skip to main content
10.1145/2534848.2534850acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
tutorial

A Comparison of String Similarity Measures for Toponym Matching

Published:03 October 2013Publication History

ABSTRACT

The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.

References

  1. Anastácio, I., Martins, B., and Calado, P. 2011. Supervised learning for linking named entities to knowledge base entries. In Proceedings of Text Analysis Conference (Gaithersburg, Maryland, November 14--15, 2011). KBP '11. National Institute of Standards and Technology, Gaithersburg, MD, n.p.Google ScholarGoogle Scholar
  2. Bartolini, I., Ciaccia, P., and Patella, M. 2002. String matching with metric trees using an approximate distance. In String Processing & Information Retrieval (SPIRE), Lecture Notes in Computer Science, 2476, 271--283, Lisbon, Portugal. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Benedetto, D., Caglioti, E., and Loreto, V. 2002. Language trees and zipping. Physical Review Letters, 88, 4, 048702. DOI=10.1103/PhysRevLett.88.048702.Google ScholarGoogle ScholarCross RefCross Ref
  4. Central Intelligence Agency. 2013. The World Factbook 2013--14. Washington, DC.Google ScholarGoogle Scholar
  5. Christen, P. 2006. A comparison of personal name matching: Techniques and practical issues. In Data Mining Workshops, Sixth IEEE International Conference on Data Mining (Hong Kong, December 18--22, 2006). IEEE, New York, 290--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christen, P., Churches, T., and Hegland, M. 2004. Febrl -- a parallel open source data linkage system. In Pacific Asia Knowledge Discovery and Data Mining (Sydney, Australia, May 20--26, 2004). Springer, New York, 638--647.Google ScholarGoogle Scholar
  7. Cilibrasi, R. and Vitányi, P. M. B. 2005. Clustering by compression. IEEE Transactions on Information Theory, 51, 4, 1523--1545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cilibrasi, R. and Vitányi, P. M. B. 2007. The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 19, 370--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cohen, W. W., Ravikumar, P., and Fienberg, S. E. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of 2003 International Joint Conferences on Artificial Intelligence (IJCAI-03) Workshop on Information Integration on the Web (Acapulco, Mexico, August 9--15, 2003). Morgan Kaufmann, San Francisco, 73--78.Google ScholarGoogle Scholar
  10. Costello, A. B., and Osborne, J. W. 2005. Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Practical Assessment, Research & Evaluation, 10, 7, 1--9.Google ScholarGoogle Scholar
  11. Cox, G. E., Kachergis, G., Recchia, G., and Jones, M. N. 2011. Toward a scalable holographic word-form representation. Behavior Research Methods, 43, 3, 602--615.Google ScholarGoogle ScholarCross RefCross Ref
  12. Damerau, F. J. 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7, 3, 171--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Friedman, C. and Sideli, R. 1992. Tolerating spelling errors during patient validation. Computers and Biomedical Research, 25, 486--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gadd, T. 1990. PHONIX: The algorithm. Program: Automated Library and Information Systems, 24, 4, 363--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gong, R. and Chan, T. K. 2006. Syllable alignment: A novel model for phonetic string search. Institute of Electronics, Information and Communication Engineers (IEICE) Transactions on Information and Systems, E89-D, 1, 332--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hastings, J. T. 2008. Automated conflation of digital gazetteer data. International Journal of Geographical Information Science, 22, 10, 1109--1127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hastings, J. T. and Hill, L. L. 2002. Treatment of 'duplicates' in the Alexandria Digital Library gazetteer. In M.J. Egenhofer, and D.M. Mark (Eds.), Geographic Information Science, Second International Conference (Extended Abstracts) (September 25--28, Boulder, Colorado, 2002). Springer, New York, 64--65.Google ScholarGoogle Scholar
  18. Jaro, M. A. 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 89, 414--420.Google ScholarGoogle ScholarCross RefCross Ref
  19. Keskustalo, H., Pirkola, A., Visala, K., Leppanen, E., and Jarvelin, K. 2003. Non-adjacent digrams improve matching of cross-lingual spelling variants. In Proccedings of String Processing & Information Retrieval (SPIRE) (Manaus, Brazil, October 8--10, 2003). Springer, New York, 252--265.Google ScholarGoogle Scholar
  20. Lennon, M., Peirce, D. S., Tarry, B. D., and Willett, P. 1981. An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3, 4, 177--183.Google ScholarGoogle ScholarCross RefCross Ref
  21. Levenshtein, V. I. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707--710.Google ScholarGoogle Scholar
  22. Maki, W. S., and Buchanan, E. 2008. Latent structure in measures of associative, semantic, and thematic knowledge. Psychonomic Bulletin & Review, 15, 3, 598--603.Google ScholarGoogle ScholarCross RefCross Ref
  23. Martins, B. 2011. A supervised machine learning approach for duplicate detection over gazetteer records. In Proceedings of the 4th International Conference on Geospatial Semantics (Brest, France, May 12--13, 2011). Springer, Berlin Heidelberg, 34--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Monge, A. E. and Elkan, C. P. 1996. The field-matching problem: Algorithm and applications. In Proceedings of ACM SIGKDD (Portland, Oregon, August 4--8, 1996). AAAI Press, Menlo Park, California, 267--270.Google ScholarGoogle Scholar
  25. Nadeau, D., and Sekine, S. 2007. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30, 1, 3--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys, 33, 1, 31--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Porter, E. H. and Winkler, W. E. 1997. Approximate String Comparison and Its Effect on an Advanced Record Linkage System. Technical Report. US Bureau of the Census.Google ScholarGoogle Scholar
  28. Sehgal, V., Getoor, L., and Viechnicki, P. D. 2006, November. Entity resolution in geospatial data integration. In Proceedings of the 14th Annual ACM International Symposium on Advances in Geographic Information Systems (Arlington, Virginia, November 10--11, 2006). ACM, New York, NY, 83--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Smith, T. F. and Waterman, M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147, 195--197.Google ScholarGoogle Scholar
  30. Stamatatos, E. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60, 3, 538--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Whitney, C. 2001. How the brain encodes the order of letters in a printed word: The SERIOL model and selective literature review. Psychonomic Bulletin & Review, 8, 221--243.Google ScholarGoogle ScholarCross RefCross Ref
  32. Winkler, W. E. 2006. Overview of Record Linkage and Current Research Directions. Technical Report. US Bureau of the Census.Google ScholarGoogle Scholar
  33. Zheng, Y., Fen, X., Xie, X., Peng, S., & Fu, J. 2010. Detecting nearly duplicated records in location datasets. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in GIS (San Jose, California, November 2--5, 2010). ACM, New York, NY, 137--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Zobel, J. and Dart, P. 1996. Phonetic string matching: Lessons from information retrieval. In Proceedings of ACM SIGIR (Zurich, Switzerland, August 18--22, 1996). ACM, New York, NY, 166--172 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Comparison of String Similarity Measures for Toponym Matching

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            COMP '13: Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
            November 2013
            75 pages
            ISBN:9781450325356
            DOI:10.1145/2534848

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 3 October 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • tutorial
            • Research
            • Refereed limited

            Acceptance Rates

            COMP '13 Paper Acceptance Rate8of14submissions,57%Overall Acceptance Rate8of14submissions,57%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader