Efficient LZ78 factorization of grammar compressed text

Authors:
Hideo Bannai;Shunsuke Inenaga;Masayuki Takeda
Affiliations:
Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan;Department of Informatics, Kyushu University, Japan
Venue:
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Year:
2012

Citing 24
Cited 0

Finding level-ancestors in trees

Journal of Computer and System Sciences
Improved dynamic dictionary matching

Information and Computation
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Fast Incremental Planarity Testing

ICALP '92 Proceedings of the 19th International Colloquium on Automata, Languages and Programming
Collage system: a unifying framework for compressed pattern matching

Theoretical Computer Science - Selected papers in honour of Setsuo Arikawa
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Optimal suffix tree construction with large alphabets

FOCS '97 Proceedings of the 38th Annual Symposium on Foundations of Computer Science
Application of Lempel--Ziv factorization to the approximation of grammar-based compression

Theoretical Computer Science
A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices

SIAM Journal on Computing
The level ancestor problem simplified

Theoretical Computer Science - Latin American theorotical informatics
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts

ACM Transactions on Algorithms (TALG)
Compressed dynamic tries with applications to LZ-compression in sublinear time and space

FSTTCS'07 Proceedings of the 27th international conference on Foundations of software technology and theoretical computer science
A faster algorithm for the computation of string convolutions using LZ78 parsing

Information Processing Letters
Fast q-gram mining on SLP compressed strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Image classification via LZ78 based string kernel: a comparative study

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Optimal pattern matching in LZW compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
An LZ78 based string kernel

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory
Faster fully compressed pattern matching by recompression

ICALP'12 Proceedings of the 39th international colloquium conference on Automata, Languages, and Programming - Volume Part I
Speeding up q-gram mining on grammar-based compressed texts

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size n representing a text S of length N, our algorithm computes the LZ78 factorization of T in $O(n\sqrt{N}+m\log N)$ time and $O(n\sqrt{N}+m)$ space, where m is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the $n\sqrt{N}$ term in the time and space complexities becomes either nL, where L is the length of the longest LZ78 factor, or (N−α) where α≥0 is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of S of a certain length. Since m=O(N/logσN) where σ is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when σ is constant, and can be more efficient when the text is compressible, i.e. when m and n are small.