Program: Automated Library and Information Systems
Tolerating spelling errors during patient validation
Computers and Biomedical Research
Phonetic string matching: lessons from information retrieval
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A technique for computer detection and correction of spelling errors
Communications of the ACM
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
String Matching with Metric Trees Using an Approximate Distance
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Entity resolution in geospatial data integration
GIS '06 Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems
Syllable Alignment: A Novel Model for Phonetic String Search
IEICE - Transactions on Information and Systems
A Comparison of Personal Name Matching: Techniques and Practical Issues
ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
The Google Similarity Distance
IEEE Transactions on Knowledge and Data Engineering
Automated conflation of digital gazetteer data
International Journal of Geographical Information Science - Digital Gazetteer Research
A survey of modern authorship attribution methods
Journal of the American Society for Information Science and Technology
Detecting nearly duplicated records in location datasets
Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
A supervised machine learning approach for duplicate detection over gazetteer records
GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
IEEE Transactions on Information Theory
Hi-index | 0.00 |
The diversity of ways in which toponyms are specified often results in mismatches between queries and the place names contained in gazetteers. Search terms that include unofficial variants of official place names, unanticipated transliterations, and typos are frequently similar but not identical to the place names contained in the gazetteer. String similarity measures can mitigate this problem, but given their task-dependent performance, the optimal choice of measure is unclear. We constructed a task in which place names had to be matched to variants of those names listed in the GEOnet Names Server, comparing 21 different measures on datasets containing romanized toponyms from 11 different countries. Best-performing measures varied widely across datasets, but were highly consistent within-country and within-language. We discuss which measures worked best for particular languages and provide recommendations for selecting appropriate string similarity measures.