Efficient Record Linkage in Large Data Sets

  • Authors:
  • Liang Jin;Chen Li;Sharad Mehrotra

  • Affiliations:
  • -;-;-

  • Venue:
  • DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
  • Year:
  • 2003

Quantified Score

Hi-index 0.02

Visualization

Abstract

This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problemconsists of determining all pairs that are similar to eachother, where the overall similarity between two records isdefined based on domain-specific similarities over individual attributes constituting the record. The record-linkageproblem arises naturally in the context of data cleansingthat usually precedes data analysis and mining. We explore a novel approach to this problem. For each attribute of records, we first map values to a multidimensionalEuclidean space that preserves domain-specific similarity.Many mapping algorithms can be applied, and we use theFastMap approach as an example. Given the merging rulethat defines when two records are similar, a set of attributesare chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is usedto determine similar pairs of records. Our extensive experiments using real data sets show that our solution has verygood efficiency and accuracy.