Estimating set intersection using small samples

Authors:
Henning Köhler
Affiliations:
The University of Queensland, Brisbane, Australia
Venue:
ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Year:
2010

Citing 10
Cited 0

Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The similarity of two sets A, B can be measured by the size of their intersection A ∩ B, relative to the size of A and B. The classic measure here is resemblance or Jaccard similarity, but other useful measures (e.g. subset containment) can be derived from intersection size as well. For large and/or many sets, exact computation of intersection size can be expensive though, and requires transmitting entire sets if they are distributed. For this reason a number of different sampling techniques have been developed, which allow us to estimate intersection size (and derived measures) efficiently from the intersection size of smaller sample sets. However, while existing estimation formulas are intuitive and unbiased, they can be quite inaccurate when samples are small. We show that by using more advanced estimation techniques, we can significantly reduce sample sizes without compromising accuracy, or conversely, obtain more accurate results from the same samples.