Copy detection mechanisms for digital documents
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
A reliable randomized algorithm for the closest-pair problem
Journal of Algorithms
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Min-wise independent permutations
Journal of Computer and System Sciences - 30th annual ACM symposium on theory of computing
A small approximately min-wise independent family of hash functions
Journal of Algorithms
Efficient and tumble similar set retrieval
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Hacker's Delight
A large-scale study of the evolution of web pages
WWW '03 Proceedings of the 12th international conference on World Wide Web
On the sample size of k-restricted min-wise independent permutations and other k-wise distributions
Proceedings of the thirty-fifth annual ACM symposium on Theory of computing
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Tabulation based 4-universal hashing with applications to second moment estimation
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations
Computational Linguistics
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM - 50th anniversary issue: 1958 - 2008
A scalable pattern mining approach to web graph compression with communities
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Collaborative data gathering in wireless sensor networks using measurement co-occurrence
Computer Communications
SpotSigs: robust and efficient near duplicate detection in large web collections
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Efficient detection of large-scale redundancy in enterprise file systems
ACM SIGOPS Operating Systems Review
Less is more: sampling the neighborhood graph makes SALSA better and faster
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Extraction and classification of dense implicit communities in the Web graph
ACM Transactions on the Web (TWEB)
An axiomatic approach for result diversification
Proceedings of the 18th international conference on World wide web
Nearest-neighbor caching for content-match applications
Proceedings of the 18th international conference on World wide web
On compressing social networks
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Large Linear Classification When Data Cannot Fit in Memory
ACM Transactions on Knowledge Discovery from Data (TKDD)
Bayesian locality sensitive hashing for fast similarity search
Proceedings of the VLDB Endowment
Fast near neighbor search in high-dimensional binary data
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Asymmetric signature schemes for efficient exact edit similarity query processing
ACM Transactions on Database Systems (TODS)
Sketching for big data recommender systems using fast pseudo-random fingerprints
ICALP'13 Proceedings of the 40th international conference on Automata, Languages, and Programming - Volume Part II
b-bit minwise hashing in practice
Proceedings of the 5th Asia-Pacific Symposium on Internetware
Document vector representations for feature extraction in multi-stage document ranking
Information Retrieval
Efficient estimation for high similarities using odd sketches
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
This paper establishes the theoretical framework of b-bit minwise hashing. The original minwise hashing method has become a standard technique for estimating set similarity (e.g., resemblance) with applications in information retrieval, data management, computational advertising, etc. By only storing b bits of each hashed value (e.g., b=1 or 2), we gain substantial advantages in terms of storage space. We prove the basic theoretical results and provide an unbiased estimator of the resemblance for any b. We demonstrate that, even in the least favorable scenario, using b=1 may reduce the storage space at least by a factor of 21.3 (or 10.7) compared to b=64 (or b=32), if one is interested in resemblance 0.5.