Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
A small approximately min-wise independent family of hash functions
Journal of Algorithms
Estimating simple functions on the union of data streams
Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Evaluating strategies for similarity search on the web
Proceedings of the 11th international conference on World Wide Web
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
A Derandomization Using Min-Wise Independent Permutations
RANDOM '98 Proceedings of the Second International Workshop on Randomization and Approximation Techniques in Computer Science
Counting Distinct Elements in a Data Stream
RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Identifying and Filtering Near-Duplicate Documents
COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Estimating Rarity and Similarity over Data Stream Windows
ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Finding Interesting Associations without Support Pruning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Processing set expressions over continuous update streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
What's new: finding significant differences in network data streams
IEEE/ACM Transactions on Networking (TON)
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Google news personalization: scalable online collaborative filtering
Proceedings of the 16th international conference on World Wide Web
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
A near-optimal algorithm for computing the entropy of a stream
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Randomized geometric algorithms and pseudo-random generators
SFCS '92 Proceedings of the 33rd Annual Symposium on Foundations of Computer Science
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Sketching Algorithms for Approximating Rank Correlations in Collaborative Filtering Systems
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sketching techniques for collaborative filtering
IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
An optimal algorithm for the distinct elements problem
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On the k-independence required by linear probing and minwise independence
ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Fingerprinting ratings for collaborative filtering: theoretical and empirical analysis
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
On the streaming complexity of computing local clustering coefficients
Proceedings of the sixth ACM international conference on Web search and data mining
STRIP: stream learning of influence probabilities
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Homomorphic fingerprints under misalignments: sketching edit and shift distances
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Hi-index | 0.00 |
In this paper we extend the notion of min-wise independent family of hash functions by defining a k-min-wise independent family of hash functions. Informally, under this definition, all subsets of size k of any fixed set X have an equal chance to have the minimal hash values among all the elements in X, when the probability is over the random choice of hash function from the family. This property measures the randomness of the family, as choosing a truly random function, obviously, satisfies the definition for k=|X|. We define and give an efficient time and space construction of approximately k-min-wise independent family of hash functions by extending Indyk's construction of approximately min-wise independent. The number of words needed to represent each function is O(kloglog(1@e)+log(1@e)), which is only suboptimal by a factor of O(loglog(1@e)), where @e@?(0,1) is the desired error bound. This construction is the first applicable for sampling bottom-k sketches out of the universe. In addition, we introduce a general and novel technique that utilizes our construction, and can be used to improve many min-wise based algorithms. As an example we show how to apply it for similarity estimation over data streams, and reduce exponentially the run time of the current known result [5]. In addition, we also discuss improvements of known algorithms for estimating rarity and entropy of random walk over graphs.