Estimating set intersection using small samples

  • Authors:
  • Henning Köhler

  • Affiliations:
  • The University of Queensland, Brisbane, Australia

  • Venue:
  • ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The similarity of two sets A, B can be measured by the size of their intersection A ∩ B, relative to the size of A and B. The classic measure here is resemblance or Jaccard similarity, but other useful measures (e.g. subset containment) can be derived from intersection size as well. For large and/or many sets, exact computation of intersection size can be expensive though, and requires transmitting entire sets if they are distributed. For this reason a number of different sampling techniques have been developed, which allow us to estimate intersection size (and derived measures) efficiently from the intersection size of smaller sample sets. However, while existing estimation formulas are intuitive and unbiased, they can be quite inaccurate when samples are small. We show that by using more advanced estimation techniques, we can significantly reduce sample sizes without compromising accuracy, or conversely, obtain more accurate results from the same samples.