Exponential time improvement for min-wise based algorithms

Authors:
Guy Feigenblat;Ely Porat;Ariel Shiftan
Affiliations:
Department of Computer Science, Bar-Ilan University, Ramat Gan 52900, Israel;Department of Computer Science, Bar-Ilan University, Ramat Gan 52900, Israel;Department of Computer Science, Bar-Ilan University, Ramat Gan 52900, Israel
Venue:
Information and Computation
Year:
2011

Citing 29
Cited 3

Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A small approximately min-wise independent family of hash functions

Journal of Algorithms
Estimating simple functions on the union of data streams

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Evaluating strategies for similarity search on the web

Proceedings of the 11th international conference on World Wide Web
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
A Derandomization Using Min-Wise Independent Permutations

RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Counting Distinct Elements in a Data Stream

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Processing set expressions over continuous update streams

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
What's new: finding significant differences in network data streams

IEEE/ACM Transactions on Networking (TON)
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering

Proceedings of the 16th international conference on World Wide Web
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
A near-optimal algorithm for computing the entropy of a stream

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Randomized geometric algorithms and pseudo-random generators

SFCS '92 Proceedings of the 33rd Annual Symposium on Foundations of Computer Science
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Greedy List Intersection

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sketching techniques for collaborative filtering

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
An optimal algorithm for the distinct elements problem

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the k-independence required by linear probing and minwise independence

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Fingerprinting ratings for collaborative filtering: theoretical and empirical analysis

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval

On the streaming complexity of computing local clustering coefficients

Proceedings of the sixth ACM international conference on Web search and data mining
STRIP: stream learning of influence probabilities

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Homomorphic fingerprints under misalignments: sketching edit and shift distances

Proceedings of the forty-fifth annual ACM symposium on Theory of computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we extend the notion of min-wise independent family of hash functions by defining a k-min-wise independent family of hash functions. Informally, under this definition, all subsets of size k of any fixed set X have an equal chance to have the minimal hash values among all the elements in X, when the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for k=|X|. We define and give an efficient time and space construction of approximately k-min-wise independent family of hash functions by extending Indyk's construction of approximately min-wise independent. The number of words needed to represent each function is O(kloglog(1@e)+log(1@e)), which is only suboptimal by a factor of O(loglog(1@e)), where @e@?(0,1) is the desired error bound. This construction is the first applicable for sampling bottom-k sketches out of the universe. In addition, we introduce a general and novel technique that utilizes our construction, and can be used to improve many min-wise based algorithms. As an example we show how to apply it for similarity estimation over data streams, and reduce exponentially the run time of the current known result [5]. In addition, we also discuss improvements of known algorithms for estimating rarity and entropy of random walk over graphs.