A supervised machine learning approach for duplicate detection over gazetteer records

  • Authors:
  • Bruno Martins

  • Affiliations:
  • Instituto Superior Técnico, INESC-ID, Porto Salvo, Portugal

  • Venue:
  • GeoS'11 Proceedings of the 4th international conference on GeoSpatial semantics
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a novel approach for detecting duplicate records in the context of digital gazetteers, using state-of-the-art machine learning techniques. It reports a thorough evaluation of alternative machine learning approaches designed for the task of classifying pairs of gazetteer records as either duplicates or not, built by using support vector machines or alternating decision trees with different combinations of similarity scores for the feature vectors. Experimental results show that using feature vectors that combine multiple similarity scores, derived from place names, semantic relationships, place types and geospatial footprints, leads to an increase in accuracy. The paper also discusses how the proposed duplicate detection approach can scale to large collections, through the usage of filtering or blocking techniques.