Leveraging aggregate constraints for deduplication

  • Authors:
  • Surajit Chaudhuri;Anish Das Sarma;Venkatesh Ganti;Raghav Kaushik

  • Affiliations:
  • Microsoft Research, Redmond, WA;Stanford University, Stanford, CA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA

  • Venue:
  • Proceedings of the 2007 ACM SIGMOD international conference on Management of data
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We show that aggregate constraints (as opposed to pairwise constraints) that often arise when integrating multiple sources of data, can be leveraged to enhance the quality of deduplication. However, despite its appeal, we show that the problem is challenging, both semantically and computationally. We define a restricted search space for deduplication that is intuitive in our context and we solve the problem optimally for the restricted space. Our experiments on real data show that incorporating aggregate constraints significantly enhances the accuracy of deduplication.