SAMC - efficient semi-adaptive data compression

  • Authors:
  • Edward Hatton

  • Affiliations:
  • RR#3 Keene Ontario, K0L 2G0

  • Venue:
  • CASCON '95 Proceedings of the 1995 conference of the Centre for Advanced Studies on Collaborative research
  • Year:
  • 1995

Quantified Score

Hi-index 0.01

Visualization

Abstract

Universal noiseless coding is of considerable interest to industry for the purposes of data reduction in order to store or transmit large volumes of typically textual data. Compression schemes have evolved from simple memoryless Huffman coding, to the Lempel-Ziv family of dictionary compression, to the current Markov or statistical modelling. This evolution has resulted in successively better compression, at an increased cost of execution time and RAM requirements. Bell, Cleary, and Moffat's Markov-based compression scheme PPMC (Prediction by Partial Match, Escape type C) is generally accepted to produce the best compression to date, but is viewed as impractical due to its prodigious memory requirements for both compression and decompression. SAMC (Semi-Adaptive Markov Compression) is proposed as a practical Markov compression scheme. The witnessed improvements are partially due to SAMC's semi-adaptive nature, that is the data is examined in its entirety, and a compression model is built in one pass, and then the data is actually compressed on a second pass. This process avoids the over- and under-adaptation that plague strictly adaptive (one-pass) compression schemes. In this paper, the optimal symbol probability estimate is derived, and a formula is obtained that evaluates the amount of storage required when using this estimate based on the alphabet size, and the counts of the individual symbols. Next the tree-based data structure that is used to store the compression model is outlined. Finally, a hill-climbing algorithm is created that minimizes the total amount of storage required for the compressed file. This algorithm is SAMC, and it is compared to several existing data compression schemes in terms of compressed size, and to PPMC to compression/decompression memory requirements and throughput.