Computing q-gram non-overlapping frequencies on SLP compressed texts

Authors:
Keisuke Goto;Hideo Bannai;Shunsuke Inenaga;Masayuki Takeda
Affiliations:
Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan;Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan;Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan;Department of Informatics, Kyushu University, Nishiku, Fukuoka, Japan
Venue:
SOFSEM'12 Proceedings of the 38th international conference on Current Trends in Theory and Practice of Computer Science
Year:
2012

Citing 6
Cited 0

Solving the String Statistics Problem in Time O(n log n)

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Compressed full-text indexes

ACM Computing Surveys (CSUR)
Efficient algorithms to compute compressed longest common substrings and compressed palindromes

Theoretical Computer Science
Fast q-gram mining on SLP compressed strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Processing compressed texts: a tractability border

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

Length-q substrings, or q -grams, can represent important characteristics of text data, and determining the frequencies of all q -grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q -grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in O (q 2n ) time and O (qn ) space where n is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for q =2 in O (n 4logn ) time and O (n 3) space.