Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem

Authors:
Moshe Dubiner
Affiliations:
Google, Inc., Mountain View, CA
Venue:
IEEE Transactions on Information Theory
Year:
2010

Citing 9
Cited 0

The light bulb problem

Information and Computation
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Similarity search in metric databases through hashing

MULTIMEDIA '01 Proceedings of the 2001 ACM workshops on Multimedia: multimedia information retrieval
Identifying and Filtering Near-Duplicate Documents

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
The bit vector intersection problem

FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Lower bounds on locality sensitive hashing

Proceedings of the twenty-second annual symposium on Computational geometry
Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions

FOCS '06 Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference

Quantified Score

Hi-index	754.84

Visualization

Abstract

The problem of finding high-dimensional approximate nearest neighbors is considered when the data is generated by some known probabilistic model. A large natural class of algorithms (bucketing codes) is investigated, Bucketing information is defined, and is proven to bound the performance of all bucketing codes. The bucketing information bound is asymptotically attained by some randomly constructed bucketing codes. The example of n Bernoulli(1/2) very long (length d → ∞) sequences of bits is singled out. It is assumed that n - 2m sequences are completely mdependent, while the remaining 2m sequences are composed of m dependent pairs. The interdependence within each pair is that their bits agree with probability 1/2 p ≤ 1. It is well known how to find most pairs with high probability by performing order of n1og22/p comparisons. It is shown that order of n1/p+ε comparisons suffice, for any ε O. A specific 2-D inequality (proven in another paper) implies that the exponent 1/p cannot be lowered. Moreover, if one sequence out of each pair belongs to a known set of n(2p-1)2 sequences, pairing can be done using order n1+ε comparisons!.