Exponential time improvement for min-wise based algorithms

Authors:
Guy Feigenblat;Ely Porat;Ariel Shiftan
Affiliations:
Bar-Ilan University, Ramat Gan, Israel;Bar-Ilan University, Ramat Gan, Israel;Bar-Ilan University, Ramat Gan, Israel
Venue:
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Year:
2011

Citing 28
Cited 1

Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A small approximately min-wise independent family of hash functions

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
A Derandomization Using Min-Wise Independent Permutations

RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Low Discrepancy Sets Yield Approximate Min-Wise Independent Permutation Families

RANDOM-APPROX '99 Proceedings of the Third International Workshop on Approximation Algorithms for Combinatorial Optimization Problems: Randomization, Approximation, and Combinatorial Algorithms and Techniques
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
What's new: finding significant differences in network data streams

IEEE/ACM Transactions on Networking (TON)
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
A near-optimal algorithm for computing the entropy of a stream

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Randomized geometric algorithms and pseudo-random generators

SFCS '92 Proceedings of the 33rd Annual Symposium on Foundations of Computer Science
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Greedy List Intersection

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sketching techniques for collaborative filtering

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the k-independence required by linear probing and minwise independence

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming

Sketching for big data recommender systems using fast pseudo-random fingerprints

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we extend the notion of min-wise independent family of hash functions by defining a k-min-wise independent family of hash functions. Informally, under this definition, all subsets of size k of any fixed set X have an equal chance to have the minimal hash values among all the elements in X, when the probability is over the random choice of hash function from the family. This property measures the randomness of the family as choosing a truly random function, obviously, satisfies the definition for k = |X|. We define and give an efficient time and space construction of approximately k-min-wise independent family of hash functions by extending Indyk's construction of approximately min-wise independent [1]. The number of words needed to represent each function is O(k log log(1/ε) + log(1/ε)), which is only suboptimal by a factor of O(log log(1/ε)), where ε ∈ (0, 1) is the desired error bound. This construction is the first applicable for sampling bottom-k sketches [2, 3] out of the universe. In addition, we introduce a general and novel technique that utilizes our construction, and can be used to improve many min-wise based algorithms, such as [4, 5, 6, 7, 3, 2, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As an example we show how to apply it for similarity estimation over data streams, and reduce exponentially the run time of the current known result [5]. In addition, we also discuss improvements of known algorithms for estimating rarity and entropy of random walk over graphs (from SODA07 [20]).