A new ppm variant for chinese text compression

Authors:
Peiliang Wu;W. j. Teahan
Affiliations:
School of informatics, university of wales bangor, dean street, bangor, gwynedd ll57 1ut, uk email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk;School of informatics, university of wales bangor, dean street, bangor, gwynedd ll57 1ut, uk email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk
Venue:
Natural Language Engineering
Year:
2008

Citing 11
Cited 1

Word-based text compression

Software—Practice & Experience
Text compression

Text compression
Self-organized language modeling for speech recognition

Readings in speech recognition
Arithmetic coding revisited

ACM Transactions on Information Systems (TOIS)
Compression techniques for Chinese text

Software—Practice & Experience
A study on word-based and integral-bit Chinese text compression algorithms

Journal of the American Society for Information Science
Text Mining: A New Frontier for Lossless Compression

DCC '99 Proceedings of the Conference on Data Compression
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Combining PPM Models Using A Text Mining Approach

DCC '01 Proceedings of the Data Compression Conference
Modelling Chinese For Text Compression

DCC '05 Proceedings of the Data Compression Conference
Dynamic Markov Compression Using a Crossbar-Like Tree Initial Structure for Chinese Texts

ICITA '05 Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05) Volume 2 - Volume 02

The sequence memoizer

Communications of the ACM

Quantified Score

Hi-index	0.03

Visualization

Abstract

Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files.