Word-Based Statistical Compressors as Natural Language Compression Boosters

Authors:
Antonio Fariña;Gonzalo Navarro;José R. Paramá
Affiliations:
-;-;-
Venue:
DCC '08 Proceedings of the Data Compression Conference
Year:
2008

Citing 0
Cited 3

Compressed text indexes: From theory to practice

Journal of Experimental Algorithmics (JEA)
Self-indexing Natural Language

SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Natural Language Compression on Edge-Guided text preprocessing

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Semistatic word-based byte-oriented compression codes are known to be attractive alternatives to compress natural language texts. With compression ratios around 30%, they allow direct pattern searching on the compressed text up to 8 times faster than on its uncompressed version. In this paper we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors such as the block-wise bzip2, those from the Ziv-Lempel family, and the predictive ppm-based ones, can benefit from compressing not the original text, but its compressed representation obtained by a word-based byte-oriented statistical compressor. In particular, our experimental results show that using Dense-Code-based compression as a preprocessing step to classical compressors like bzip2, gzip, or ppmdi, yields several important benefits. For example, the ppm family is known for achieving the best compression ratios. With a Dense coding preprocessing, ppmdi achieves even better compression ratios (the best we know of on natural language) and much faster compression/decompression than ppmdi alone. Text indexing also profits from our preprocessing step. A compressed self-index achieves much better space and time performance when preceded by a semistatic word-based compression step. We show, for example, that the AF-FMindex coupled with Tagged Huffman coding is an attractive alternative index for natural language texts.