Quasi-distinct parsing and optimal compression methods

Authors:
Amihood Amir;Yonatan Aumann;Avivit Levy;Yuri Roshko
Affiliations:
Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel and Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, United States;Department of Computer Science, Bar Ilan University, Ramat Gan 52900, Israel;Shenkar College, Anna Frank 12, Ramat Gan 52526, Israel and CRI, Haifa University, Mount Carmel, Haifa 31905, Israel;Shenkar College, Anna Frank 12, Ramat Gan 52526, Israel
Venue:
Theoretical Computer Science
Year:
2012

Citing 13
Cited 1

Text compression

Text compression
Elements of information theory

Elements of information theory
An analysis of the Burrows—Wheeler transform

Journal of the ACM (JACM)
Generalized Lempel-Ziv parsing scheme and its preliminary analysis of the average profile

DCC '95 Proceedings of the Conference on Data Compression
Space-efficient static trees and graphs

SFCS '89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science
The Practical Efficiency of Convolutions in Pattern Matching Algorithms

Fundamenta Informaticae - Workshop on Combinatorial Algorithms
The universal LZ77 compression algorithm is essentially optimal for individual finite-length N-blocks

IEEE Transactions on Information Theory
Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression

IEEE Transactions on Information Theory
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory
On the Complexity of Finite Sequences

IEEE Transactions on Information Theory
A universal algorithm for sequential data compression

IEEE Transactions on Information Theory
Coding theorems for individual sequences

IEEE Transactions on Information Theory
Compression of individual sequences via variable-rate coding

IEEE Transactions on Information Theory

On parsing optimality for dictionary-based text compression-the Zip case

Journal of Discrete Algorithms

Quantified Score

Hi-index	5.23

Visualization

Abstract

In this paper, the optimality proof of Ziv-Lempel coding is re-studied, and a more general compression optimality theorem is derived. In particular, the property of quasi-distinct parsing is defined. This property allows infinitely many repetitions of phrases in the parsing as long as the total number of repetitions is o(n/logn), where n is length of the parsed string. The quasi-distinct parsing property is weaker than distinct parsing used in the original proof which does not allow repetitions of phrases in the parsing. Yet we show that the theorem holds with this weaker property as well. This provides a better understanding of the optimality proof of Ziv-Lempel coding, together with a new tool for proving optimality of other compression schemes which is applicable for a much wider family of codes. To demonstrate the possible use of this generalization, a new coding method-the Arithmetic Progression Tree coding (APT)-is presented. This new coding method is based on a principle that is very different from Ziv-Lempel's coding. Nevertheless, the APT coding is analyzed in this paper and using the generalized theorem shown to be asymptotically optimal up to a constant factor, if the APT quasi-distinctness hypothesis holds. An empirical evidence that this hypothesis holds is also given.