The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Making large-scale support vector machine learning practical
Advances in kernel methods
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Efficient clustering of high-dimensional data sets with application to reference matching
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
The double metaphone search algorithm
C/C++ Users Journal
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
The Alternating Decision Tree Learning Algorithm
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints
ECDL '00 Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Georeferencing: The Geographic Associations of Information (Digital Libraries and Electronic Publishing)
Entity resolution in geospatial data integration
GIS '06 Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Adaptive Blocking: Learning to Scale Up Record Linkage
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
GeoDDupe: A Novel Interface for Interactive Entity Resolution in Geospatial Data
IV '07 Proceedings of the 11th International Conference Information Visualization
Object fusion in geographic information systems
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Automated conflation of digital gazetteer data
International Journal of Geographical Information Science - Digital Gazetteer Research
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
An Introduction to Duplicate Detection
An Introduction to Duplicate Detection
Detecting nearly duplicated records in location datasets
Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems
Finding similar objects using a taxonomy: a pragmatic approach
ODBASE'06/OTM'06 Proceedings of the 2006 Confederated international conference on On the Move to Meaningful Internet Systems: CoopIS, DOA, GADA, and ODBASE - Volume Part I
Information retrieval and deduplication for tourism recommender sightsplanner
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Detecting near-duplicate documents using sentence-level features and supervised learning
Expert Systems with Applications: An International Journal
Improving geo-spatial linked data with the wisdom of the crowds
Proceedings of the Joint EDBT/ICDT 2013 Workshops
Automatic gazetteer enrichment with user-geocoded data
Proceedings of the Second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information
A Comparison of String Similarity Measures for Toponym Matching
Proceedings of The First ACM SIGSPATIAL International Workshop on Computational Models of Place
Hi-index | 0.00 |
This paper presents a novel approach for detecting duplicate records in the context of digital gazetteers, using state-of-the-art machine learning techniques. It reports a thorough evaluation of alternative machine learning approaches designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using support vector machines or alternating decision trees with different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an increase in accuracy. The paper also discusses how the proposed duplicate detection approach can scale to large collections, through the usage of filtering or blocking techniques.