Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Authors:
Mikkel Thorup
Affiliations:
AT&T Labs--Research and University of Copenhagen, Florham Park, NJ, USA
Venue:
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Year:
2013

Citing 30
Cited 0

Randomized algorithms

Randomized algorithms
Chernoff-Hoeffding Bounds for Applications with Limited Independence

SIAM Journal on Discrete Mathematics
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes

STACS '96 Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Polynomial Hash Functions Are Reliable (Extended Abstract)

ICALP '92 Proceedings of the 19th International Colloquium on Automata, Languages and Programming
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Inside the Slammer Worm

IEEE Security and Privacy
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Confidence intervals for priority sampling

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Why simple hash functions work: exploiting the entropy in a data stream

Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
New classes and applications of hash functions

SFCS '79 Proceedings of the 20th Annual Symposium on Foundations of Computer Science
Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Linear Probing with Constant Independence

SIAM Journal on Computing
Sketching techniques for collaborative filtering

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
On the k-independence required by linear probing and minwise independence

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory
Tabulation-Based 5-Independent Hashing with Applications to Linear Probing and Second Moment Estimation

SIAM Journal on Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider bottom-k sampling for a set X, picking a sample Sk(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |Sk(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A union B)=Sk(Sk(A) union Sk(B)), and then the similarity is estimated as |Sk(A union B) intersect Sk(A) intersect Sk(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1√(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h1,...,hk, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.