A reliable randomized algorithm for the closest-pair problem
Journal of Algorithms
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
A small approximately min-wise independent family of hash functions
Journal of Algorithms
Universal Hashing and k-Wise Independent Random Variables via Integer Arithmetic without Primes
STACS '96 Proceedings of the 13th Annual Symposium on Theoretical Aspects of Computer Science
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
Universal classes of hash functions (Extended Abstract)
STOC '77 Proceedings of the ninth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Using sketches to estimate associations
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
Proceedings of the 24th international conference on Machine learning
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
LIBLINEAR: A Library for Large Linear Classification
The Journal of Machine Learning Research
Efficient detection of large-scale redundancy in enterprise file systems
ACM SIGOPS Operating Systems Review
Nearest-neighbor caching for content-match applications
Proceedings of the 18th international conference on World wide web
Feature hashing for large scale multitask learning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Applying syntactic similarity algorithms for enterprise information management
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Relational query coprocessing on graphics processors
ACM Transactions on Database Systems (TODS)
Precomputing search features for fast and accurate query classification
Proceedings of the third ACM international conference on Web search and data mining
Hash Kernels for Structured Data
The Journal of Machine Learning Research
Proceedings of the 19th international conference on World wide web
FAST: fast architecture sensitive tree search on modern CPUs and GPUs
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
On the k-independence required by linear probing and minwise independence
ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
Selective block minimization for faster convergence of limited memory large-scale linear models
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Fast near neighbor search in high-dimensional binary data
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Hi-index | 0.00 |
Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 ~ 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes im- possible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.