A Comparison of String Similarity Measures for Toponym Matching

  • Authors:
  • Affiliations:
  • Venue:
  • Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.