Efficient approximate dictionary look-up for long words over small alphabets

  • Authors:
  • Abdullah N. Arslan

  • Affiliations:
  • Department of Computer Science, University of Vermont, Burlington, VT

  • Venue:
  • LATIN'06 Proceedings of the 7th Latin American conference on Theoretical Informatics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Given a dictionary ${\mathcal W}$ consisting of n binary strings of length m each, a d-query asks if there exists a string in ${\mathcal W}$ within Hamming distance d of a given binary query string q. The problem was posed by Minsky and Papert in 1969 as a challenge to data structure design. There is a tradeoff between time and space in solving the problem of answering a d-query. Recently developed time-efficient methods for text indexing with errors can be used to answer a d-query in O(m) time. However, these methods use O(nlogdn) (or more) additional space which is not practical for large databases. We present a method for the problem assuming the standard RAM model of computation. We process the dictionary to construct an edge-labelled tree with distinct labels to siblings, and with bounded branching factor and height. Storing the resulting tree does not require asymptotically more space than the size of an ordinary trie that stores the given dictionary. We present an algorithm for the d-query problem that takes O(m(3 log4/3n – 1)d (log2n)d+1) time, and uses only O(m) additional space. We also generalize the results for the case of the problem when a larger alphabet, or edit distance are used. We achieve $O(m(2|\Sigma|-1)^{d}(log_{(2|\Sigma|-1)}{\it n} -1) ^{d}(log_{2}n)^{d+1})$ time complexity for the problem when Hamming distance is used. The time complexity increases by a factor of $O(d(2|\Sigma|-1)^d(log_{2}n)^{d})$ when we use edit distance. The algorithms are efficient when the approximate dictionary look-up involves long words defined over small alphabets. The algorithm can be modified such that it allows for words of different lengths as well as different lengths of query strings.