A Comparison of String Similarity Measures for Toponym Matching

Authors:
Affiliations:
Venue:
Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
Year:
2013

Citing 15
Cited 0

PHOENIX: the algorithm

Program: Automated Library and Information Systems
Tolerating spelling errors during patient validation

Computers and Biomedical Research
Phonetic string matching: lessons from information retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A technique for computer detection and correction of spelling errors

Communications of the ACM
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
String Matching with Metric Trees Using an Approximate Distance

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Entity resolution in geospatial data integration

GIS '06 Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems
Syllable Alignment: A Novel Model for Phonetic String Search

IEICE - Transactions on Information and Systems
A Comparison of Personal Name Matching: Techniques and Practical Issues

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
The Google Similarity Distance

IEEE Transactions on Knowledge and Data Engineering
Automated conflation of digital gazetteer data

International Journal of Geographical Information Science - Digital Gazetteer Research
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
Detecting nearly duplicated records in location datasets

Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
A supervised machine learning approach for duplicate detection over gazetteer records

GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.