b-bit minwise hashing in practice

Authors:
Ping Li;Anshumali Shrivastava;Arnd Christian König
Affiliations:
Rutgers University, Piscataway, NJ;Cornell University, Ithaca, NY;Microsoft Corporation, Redmond, WA
Venue:
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Year:
2013

Citing 27
Cited 0

A reliable randomized algorithm for the closest-pair problem

Journal of Algorithms
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
A small approximately min-wise independent family of hash functions

Journal of Algorithms
Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes

STACS '96 Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Universal classes of hash functions (Extended Abstract)

STOC '77 Proceedings of the ninth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Using sketches to estimate associations

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

Proceedings of the 24th international conference on Machine learning
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
LIBLINEAR: A Library for Large Linear Classification

The Journal of Machine Learning Research
Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review
Nearest-neighbor caching for content-match applications

Proceedings of the 18th international conference on World wide web
Feature hashing for large scale multitask learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Applying syntactic similarity algorithms for enterprise information management

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Relational query coprocessing on graphics processors

ACM Transactions on Database Systems (TODS)
Precomputing search features for fast and accurate query classification

Proceedings of the third ACM international conference on Web search and data mining
Hash Kernels for Structured Data

The Journal of Machine Learning Research
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
FAST: fast architecture sensitive tree search on modern CPUs and GPUs

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On the k-independence required by linear probing and minwise independence

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Selective block minimization for faster convergence of limited memory large-scale linear models

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast near neighbor search in high-dimensional binary data

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ~ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes im- possible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.