The bit vector intersection problem

Authors:
R. M. Karp;O. Waarts;G. Zweig
Affiliations:
-;-;-
Venue:
FOCS '95 Proceedings of the 36th Annual Symposium on Foundations of Computer Science
Year:
1995

Citing 0
Cited 7

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Connectivity structure of bipartite graphs via the KNC-plot

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
The Closest Pair Problem under the Hamming Metric

COCOON '09 Proceedings of the 15th Annual International Conference on Computing and Combinatorics
Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem

IEEE Transactions on Information Theory
Optimal hash functions for approximate matches on the n-cube

IEEE Transactions on Information Theory
A comparison of extended fingerprint hashing and locality sensitive hashing for binary audio fingerprints

Proceedings of the 1st ACM International Conference on Multimedia Retrieval

Quantified Score

Hi-index	0.12

Visualization

Abstract

This paper introduces the bit vector intersection problem: given a large collection of sparse bit vectors, find all the pairs with at least t ones in common for a given input parameter t. The assumption is that the number of ones common to any two vectors is significantly less than t, except for an unknown set of O(n) pairs. This problem has important applications in DNA physical mapping, clustering, and searching for approximate dictionary matches. We present two randomized algorithms that solve this problem with high probability and in sub-quadratic expected time. One of these algorithms is based on a recursive tree-searching procedure, and the other on hashing. We analyze the tree scheme in terms of branching processes, while our analysis of the hashing scheme is based on Markov chains. Since both algorithms have similar asymptotic performance, we also examine experimentally their relative merits in practical situations. We conclude by showing that a fundamental problem arising in the Human Genome Project is captured by the bit vector intersection problem described above and hence can be solved by our algorithms.