Similarity joins as stronger metric operations

  • Authors:
  • Wei Wang

  • Affiliations:
  • University of New South Wales, Australia

  • Venue:
  • SIGSPATIAL Special
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Similarity joins between two sets of records return pairs of records whose similarity is no less than a given threshold. More specifically, consider two sets of records, R and S, a similarity function sim(.,.) and a threshold t, a similarity join between R and S is defined as { (r, s) | (r, s) ∈ R x S, sim(r, s) ≥ t }. A similarity join is a generalization of the traditional equality join commonly found in database systems. A variant of the similarity join is to use a distance threshold to replace the similarity threshold. It is generally expected that the similarity threshold is close to the maximum possible value (usually 1.0), and the distance threshold is close to the minimum possible value (usually 0). For example, we may find near-duplicate documents in a document repository using a cosine similarity threshold of 0.9, or we may find pairs of incorrectly spelt queries and their correct versions in a query log with an edit distance threshold of 2.