Faster approximate pattern matching in compressed repetitive texts

  • Authors:
  • Travis Gagie;Paweł Gawrychowski

  • Affiliations:
  • Department of Computer Science, Aalto University, Espoo, Finland;Department of Computer Science, University of Wrocław, Wrocław, Poland

  • Venue:
  • ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an $\ensuremath{\mathcal{O}\!\left( {r} \right)}$-word data structure that allows us to extract any substring s [i..j] in $\ensuremath{\mathcal{O}\!\left( {\log n + j - i} \right)}$ time. They also showed how, given a pattern p of length m and an edit distance k≤m, their data structure supports finding all occ approximate matches to p in s in $\ensuremath{\mathcal{O}\!\left( {r (\min (m k, k^4 + m) + \log n) + \ensuremath{\mathsf{occ}}} \right)}$ time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with $\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}$ rules. In this paper we give a simple $\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}$-word data structure that takes the same time for substring extraction but only $\ensuremath{\mathcal{O}\!\left( {z (\min (m k, k^4 + m)) + \ensuremath{\mathsf{occ}}} \right)}$ time for approximate pattern matching.