Faster approximate pattern matching in compressed repetitive texts

Authors:
Travis Gagie;Paweł Gawrychowski
Affiliations:
Department of Computer Science, Aalto University, Espoo, Finland;Department of Computer Science, University of Wrocław, Wrocław, Poland
Venue:
ISAAC'11 Proceedings of the 22nd international conference on Algorithms and Computation
Year:
2011

Citing 14
Cited 0

Fast parallel and serial approximate string matching

Journal of Algorithms
Data compression via textual substitution

Journal of the ACM (JACM)
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Approximate String Matching: A Simpler Faster Algorithm

SIAM Journal on Computing
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
Note: A simple storage scheme for strings achieving entropy bounds

Theoretical Computer Science
Compressed Text Indexes with Fast Locate

CPM '07 Proceedings of the 18th annual symposium on Combinatorial Pattern Matching
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
On compressing the textual web

Proceedings of the third ACM international conference on Web search and data mining
LZ77-Like Compression with Fast Random Access

DCC '10 Proceedings of the 2010 Data Compression Conference
Self-indexing based on LZ77

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Stronger Lempel-Ziv Based Compressed Text Indexing

Algorithmica
Grammar-based compression in a streaming model

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an $\ensuremath{\mathcal{O}\!\left( {r} \right)}$-word data structure that allows us to extract any substring s [i..j] in $\ensuremath{\mathcal{O}\!\left( {\log n + j - i} \right)}$ time. They also showed how, given a pattern p of length m and an edit distance k≤m, their data structure supports finding all occ approximate matches to p in s in $\ensuremath{\mathcal{O}\!\left( {r (\min (m k, k^4 + m) + \log n) + \ensuremath{\mathsf{occ}}} \right)}$ time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with $\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}$ rules. In this paper we give a simple $\ensuremath{\mathcal{O}\!\left( {z \log n} \right)}$-word data structure that takes the same time for substring extraction but only $\ensuremath{\mathcal{O}\!\left( {z (\min (m k, k^4 + m)) + \ensuremath{\mathsf{occ}}} \right)}$ time for approximate pattern matching.