Efficient approximate dictionary look-up for long words over small alphabets

Authors:
Abdullah N. Arslan
Affiliations:
Department of Computer Science, University of Vermont, Burlington, VT
Venue:
LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
Year:
2006

Citing 11
Cited 1

Algorithms for approximate string matching

Information and Control
An algorithm for approximate membership checking with application to password security

Information Processing Letters
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Dictionary look-up with one error

Journal of Algorithms
Neighborhood preserving hashing and approximate queries

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Efficient Storage and Retrieval by Content and Address of Static Files

Journal of the ACM (JACM)
Improved bounds for dictionary look-up with one error

Information Processing Letters
Tries for Approximate String Matching

IEEE Transactions on Knowledge and Data Engineering
Approximate Dictionary Queries

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Text indexing with errors

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

Fast motif search in protein sequence databases

CSR'06 Proceedings of the First international computer science conference on Theory and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a dictionary ${\mathcal W}$ consisting of n binary strings of length m each, a d-query asks if there exists a string in ${\mathcal W}$ within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 as a challenge to data structure design. There is a tradeoff between time and space in solving the problem of answering a d-query. Recently developed time-efficient methods for text indexing with errors can be used to answer a d-query in O(m) time. However, these methods use O(nlogdn) (or more) additional space which is not practical for large databases. We present a method for the problem assuming the standard RAM model of computation. We process the dictionary to construct an edge-labelled tree with distinct labels to siblings, and with bounded branching factor and height. Storing the resulting tree does not require asymptotically more space than the size of an ordinary trie that stores the given dictionary. We present an algorithm for the d-query problem that takes O(m(3 log4/3n – 1)d (log2n)d+1) time, and uses only O(m) additional space. We also generalize the results for the case of the problem when a larger alphabet, or edit distance are used. We achieve $O(m(2|\Sigma|-1)^{d}(log_{(2|\Sigma|-1)}{\it n} -1) ^{d}(log_{2}n)^{d+1})$ time complexity for the problem when Hamming distance is used. The time complexity increases by a factor of $O(d(2|\Sigma|-1)^d(log_{2}n)^{d})$ when we use edit distance. The algorithms are efficient when the approximate dictionary look-up involves long words defined over small alphabets. The algorithm can be modified such that it allows for words of different lengths as well as different lengths of query strings.