Efficient exact set-similarity joins

  • Authors:
  • Arvind Arasu;Venkatesh Ganti;Raghav Kaushik

  • Affiliations:
  • Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA;Microsoft Research, One Microsoft Way, Redmond, WA

  • Venue:
  • VLDB '06 Proceedings of the 32nd international conference on Very large data bases
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given two input collections of sets, a set-similarity join (SSJoin) identifies all pairs of sets, one from each collection, that have high similarity. Recent work has identified SSJoin as a useful primitive operator in data cleaning. In this paper, we propose new algorithms for SSJoin. Our algorithms have two important features: They are exact, i.e., they always produce the correct answer, and they carry precise performance guarantees. We believe our algorithms are the first to have both features; previous algorithms with performance guarantees are only probabilistically approximate. We demonstrate the effectiveness of our algorithms using a thorough experimental evaluation over real-life and synthetic data sets.