A new ppm variant for chinese text compression

  • Authors:
  • Peiliang Wu;W. j. Teahan

  • Affiliations:
  • School of informatics, university of wales bangor, dean street, bangor, gwynedd ll57 1ut, uk email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk;School of informatics, university of wales bangor, dean street, bangor, gwynedd ll57 1ut, uk email: perry@informatics.bangor.ac.uk, wjt@informatics.bangor.ac.uk

  • Venue:
  • Natural Language Engineering
  • Year:
  • 2008

Quantified Score

Hi-index 0.03

Visualization

Abstract

Large alphabet languages such as Chinese are very different from English, and therefore present different problems for text compression. In this article, we first examine the characteristics of Chinese, then we introduce a new variant of the Prediction by Partial Match (PPM) model especially for Chinese characters. Unlike the traditional PPM coding schemes, which encodes an escape probability if a novel character occurs in the context, the new coding scheme directly encodes the order first before encoding a symbol, without having to output an escape probability. This scheme achieves excellent compression rates in comparison with other schemes on a variety of Chinese text files.