Multi-Lingual Cascading Text Compressors for WWW

Authors:
Chi-Hung Chi
Affiliations:
-
Venue:
ITCC '00 Proceedings of the The International Conference on Information Technology: Coding and Computing (ITCC'00)
Year:
2000

Citing 5
Cited 0

Modeling for text compression

ACM Computing Surveys (CSUR)
Text compression

Text compression
Extending Huffman coding for multilingual text compression

DCC '95 Proceedings of the Conference on Data Compression
Study on Mult-lingual LZ77 and LZ78 Text Compression

DCC '98 Proceedings of the Conference on Data Compression
A Technique for High-Performance Data Compression

Computer

Quantified Score

Hi-index	0.00

Visualization

Abstract

Global sharing and distribution of information on Internet result in a great demand for efficient multi-lingual text compression for web server and proxy implementation. Current text compressors such as Huffman coding, Lempel-Ziv (LZ) variants, and LZ-Huffman cascading fail to perform efficiently because of the mismatched character sampling size and the large character set of the multilingual languages. Our previous research [7,8] already showed that better compression ratio could be obtained by re-adjusting the character-sampling rate.In this paper, we investigate the cascading of LZ variants to Huffman coding for multilingual documents. Two basic approaches, static and dynamic dictionaries, are proposed. Techniques for reducing the dictionary overhead are also suggested. Based on our multi-lingual corpus, our adaptive cascading scheme can perform better than the well-known cascading compressor, gzip, by an average of about 20%.