Fast near neighbor search in high-dimensional binary data

Authors:
Anshumali Shrivastava;Ping Li
Affiliations:
Cornell University, Ithaca, NY;Cornell University, Ithaca, NY
Venue:
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Year:
2012

Citing 13
Cited 1

Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming

Journal of the ACM (JACM)
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
An Algorithm for Finding Nearest Neighbors

IEEE Transactions on Computers
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Semantic hashing

International Journal of Approximate Reasoning
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
Improving random projections using marginal information

COLT'06 Proceedings of the 19th annual conference on Learning Theory
Mining of Massive Datasets

Mining of Massive Datasets

b-bit minwise hashing in practice

Proceedings of the 5th Asia-Pacific Symposium on Internetware

Quantified Score

Hi-index	0.00

Visualization

Abstract

Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.