Speeding up q-gram mining on grammar-based compressed texts

  • Authors:
  • Keisuke Goto;Hideo Bannai;Shunsuke Inenaga;Masayuki Takeda

  • Affiliations:
  • Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan

  • Venue:
  • CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size n that represents string T, the algorithm computes the occurrence frequencies of allq-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m=O(qn), the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous O(qn) algorithm when q=Ω(|T|/n).