Improved Approximate String Matching Using Compressed Suffix Data Structures

  • Authors:
  • Tak-Wah Lam;Wing-Kin Sung;Swee-Seong Wong

  • Affiliations:
  • The University of Hong Kong, Department of Computer Science, Hong Kong, Hong Kong;National University of Singapore, School of Computing, Singapore, Singapore;National University of Singapore, School of Computing, Singapore, Singapore

  • Venue:
  • Algorithmica
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an $O(n\sqrt{\log n}\log |A|)$-bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog 2 n) bits. The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log  ε n, for 0ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A| k m k (k+log log n)+occ) and O(log  ε n(|A| k m k (k+log log n)+occ)) time using an $O(n\sqrt{\log n}\log |A|)$-bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by $O(2^{\sqrt{\log n}})$ for the $O(n\sqrt{\log n}\log |A|)$-bit space data structure.