Randomized algorithms
Chernoff-Hoeffding Bounds for Applications with Limited Independence
SIAM Journal on Discrete Mathematics
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
Finding Interesting Associations without Support Pruning
IEEE Transactions on Knowledge and Data Engineering
Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes
STACS '96 Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Estimating Rarity and Similarity over Data Stream Windows
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Polynomial Hash Functions Are Reliable (Extended Abstract)
ICALP '92 Proceedings of the 19th International Colloquium on Automata, Languages and Programming
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
IEEE Security and Privacy
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Confidence intervals for priority sampling
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Why simple hash functions work: exploiting the entropy in a data stream
Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms
New classes and applications of hash functions
SFCS '79 Proceedings of the 20th Annual Symposium on Foundations of Computer Science
Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Linear Probing with Constant Independence
SIAM Journal on Computing
Sketching techniques for collaborative filtering
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
On the k-independence required by linear probing and minwise independence
ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums
SIAM Journal on Computing
Learn more, sample less: control of volume and variance in network measurement
IEEE Transactions on Information Theory
SIAM Journal on Computing
Hi-index | 0.00 |
We consider bottom-k sampling for a set X, picking a sample Sk(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |Sk(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as Sk(A union B)=Sk(Sk(A) union Sk(B)), and then the similarity is estimated as |Sk(A union B) intersect Sk(A) intersect Sk(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1√(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h1,...,hk, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.