Improved approximate string matching using compressed suffix data structures

  • Authors:
  • Tak-Wah Lam;Wing-Kin Sung;Swee-Seong Wong

  • Affiliations:
  • Department of Computer Science, The University of HongKong, HongKong;School of Computing, National University of Singapore, Singapore;School of Computing, National University of Singapore, Singapore

  • Venue:
  • ISAAC'05 Proceedings of the 16th international conference on Algorithms and Computation
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over a fixed alphabet A, we can preprocess T and give an $O(n\sqrt{{\rm log} n})$-bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(m log log n + occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(n log2n) bits. The space of our data structure can be further reduced to O(n) if we can afford a slow down factor of logεn, for 0 ε ≤ 1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A|kmk(k+log log n) + occ) and O(logεn (|A|kmk(k+log log n) + occ)) query time using an $O(n\sqrt{{\rm log} n})$-bit and an O(n)-bit indexing data structures, respectively.