b-Bit minwise hashing

Authors:
Ping Li;Christian König
Affiliations:
Cornell University, Ithaca, NY, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 29
Cited 9

Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A reliable randomized algorithm for the closest-pair problem

Journal of Algorithms
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations

Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
A small approximately min-wise independent family of hash functions

Journal of Algorithms
Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Hacker's Delight

Hacker's Delight
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the sample size of k-restricted min-wise independent permutations and other k-wise distributions

Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Tabulation based 4-universal hashing with applications to second moment estimation

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A scalable pattern mining approach to web graph compression with communities

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Opinion spam and analysis

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Collaborative data gathering in wireless sensor networks using measurement co-occurrence

Computer Communications
SpotSigs: robust and efficient near duplicate detection in large web collections

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review
Less is more: sampling the neighborhood graph makes SALSA better and faster

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Extraction and classification of dense implicit communities in the Web graph

ACM Transactions on the Web (TWEB)
An axiomatic approach for result diversification

Proceedings of the 18th international conference on World wide web
Nearest-neighbor caching for content-match applications

Proceedings of the 18th international conference on World wide web
Derandomized Constructions of k-Wise (Almost) Independent Permutations

Algorithmica
On compressing social networks

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Large Linear Classification When Data Cannot Fit in Memory

ACM Transactions on Knowledge Discovery from Data (TKDD)
Bayesian locality sensitive hashing for fast similarity search

Proceedings of the VLDB Endowment
Fast near neighbor search in high-dimensional binary data

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Succinct interval-splitting tree for scalable similarity search of compound-protein pairs with property constraints

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Asymmetric signature schemes for efficient exact edit similarity query processing

ACM Transactions on Database Systems (TODS)
Sketching for big data recommender systems using fast pseudo-random fingerprints

ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II
b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware
Document vector representations for feature extraction in multi-stage document ranking

Information Retrieval
Efficient estimation for high similarities using odd sketches

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value (e.g., b=1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b=64 (or b=32), if one is interested in resemblance 0.5.