Efficient algorithms for substring near neighbor problem

Authors:
Alexandr Andoni;Piotr Indyk
Affiliations:
MIT;MIT
Venue:
SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Year:
2006

Citing 19
Cited 9

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Efficient 2-dimensional approximate matching of half-rectangular figures

Information and Computation
Two algorithms for nearest-neighbor search in high dimensions

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient search for approximate nearest neighbor in high dimensional spaces

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
An optimal algorithm for approximate nearest neighbor searching

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Pattern matching for sets of segments

SODA '01 Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Provably sensitive Indexing strategies for biosequence similarity search

Proceedings of the sixth annual international conference on Computational biology
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Identifying Representative Trends in Massive Time Series Data Sets Using Sketches

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The LCA Problem Revisited

LATIN '00 Proceedings of the 4th Latin American Symposium on Theoretical Informatics
Simple and Practical Sequence Nearest Neighbors with Block Operations

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Faster Algorithms for String Matching Problems: Matching the Convolution Bound

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Efficient randomized pattern-matching algorithms

IBM Journal of Research and Development - Mathematics and computing

A dictionary for approximate string search and longest prefix search

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Nearest neighbor search methods for handshape recognition

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Overcoming the l1 non-embeddability barrier: algorithms for product metrics

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Query expansion for hash-based image object retrieval

MM '09 Proceedings of the 17th ACM international conference on Multimedia
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Fingerprints in compressed strings

WADS'13 Proceedings of the 13th international conference on Algorithms and Data Structures
Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Proceedings of the VLDB Endowment
Optimal Lower Bounds for Locality-Sensitive Hashing (Except When q is Tiny)

ACM Transactions on Computation Theory (TOCT)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we consider the problem of finding the approximate nearest neighbor when the data set points are the substrings of a given text T. Specifically, for a string T of length n, we present a data structure which does the following: given a pattern P, if there is a substring of T within the distance R from P, it reports a (possibly different) substring of T within distance cR from P. The length of the pattern P, denoted by m, is not known in advance. For the case where the distances are measured using the Hamming distance, we present a data structure which uses Õ(n1+1/c) space1 and with Õ(n1/c + mno(1)) query time. This essentially matches the earlier bounds of [Ind98], which assumed that the pattern length m is fixed in advance. In addition, our data structure can be constructed in time Õ(n1+1/c + n1+o(1)M1/3), where M is an upper bound for m. This essentially matches the preprocessing bound of [Ind98] as long as the term Õ(n1+1/c) dominates the running time, which is the case when, e.g., c l1 distance. The query time and the space bound are essentially the same, while the preprocessing time becomes Õ(n1+1/c + n1+o(1)M2/3).