Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Mining database structure; or, how to build a data quality browser
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
A bi-level Bernoulli scheme for database sampling
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Finding similar files in a large file system
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Hi-index | 0.00 |
The similarity of two sets A, B can be measured by the size of their intersection A ∩ B, relative to the size of A and B. The classic measure here is resemblance or Jaccard similarity, but other useful measures (e.g. subset containment) can be derived from intersection size as well. For large and/or many sets, exact computation of intersection size can be expensive though, and requires transmitting entire sets if they are distributed. For this reason a number of different sampling techniques have been developed, which allow us to estimate intersection size (and derived measures) efficiently from the intersection size of smaller sample sets. However, while existing estimation formulas are intuitive and unbiased, they can be quite inaccurate when samples are small. We show that by using more advanced estimation techniques, we can significantly reduce sample sizes without compromising accuracy, or conversely, obtain more accurate results from the same samples.