Learning blocking schemes for record linkage

  • Authors:
  • Matthew Michelson;Craig A. Knoblock

  • Affiliations:
  • University of Southern California, Information Sciences Institute, Marina del Rey, CA;University of Southern California, Information Sciences Institute, Marina del Rey, CA

  • Venue:
  • AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is impractical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For example, in a person matching problem, blocking might return all people with the same last name as candidate matches. Two main problems in blocking are the selection of attributes for generating the candidate matches and deciding which methods to use to compare the selected attributes. These attribute and method choices constitute a blocking scheme. Previous approaches to record linkage address the blocking issue in a largely ad-hoc fashion. This paper presents a machine learning approach to automatically learn effective blocking schemes. We validate our approach with experiments that show our learned blocking schemes outperform the ad-hoc blocking schemes of non-experts and perform comparably to those manually built by a domain expert.