Speeding up q-gram mining on grammar-based compressed texts

Authors:
Keisuke Goto;Hideo Bannai;Shunsuke Inenaga;Masayuki Takeda
Affiliations:
Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan
Venue:
CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Year:
2012

Citing 17
Cited 1

Finding level-ancestors in trees

Journal of Computer and System Sciences
Data compression via textual substitution

Journal of the ACM (JACM)
Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Speeding Up Pattern Matching by Text Compression

CIAC '00 Proceedings of the 4th Italian Conference on Algorithms and Complexity
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

SIAM Journal on Computing
The level ancestor problem simplified

Theoretical Computer Science - Latin American theorotical informatics
Real-Time Traversal in Grammar-Based Compressed Files

DCC '05 Proceedings of the Data Compression Conference
Linear work suffix array construction

Journal of the ACM (JACM)
A Technique for High-Performance Data Compression

Computer
Compressing and indexing labeled trees, with applications

Journal of the ACM (JACM)
Pattern matching in lempel-Ziv compressed strings: fast, simple, and deterministic

ESA'11 Proceedings of the 19th European conference on Algorithms
Fast q-gram mining on SLP compressed strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae

Efficient LZ78 factorization of grammar compressed text

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $\mathcal{T}$ of size n that represents string T, the algorithm computes the occurrence frequencies of allq-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size $m = |T|-\mathit{dup}(q,\mathcal{T})$, where $\mathit{dup}(q,\mathcal{T})$ is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m=O(qn), the running time of our algorithm is $O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\})$, improving our previous O(qn) algorithm when q=Ω(|T|/n).