Linear-Time, Incremental Hierarchy Inference for Compression

  • Authors:
  • Craig G. Nevill-Manning;Ian H. Witten

  • Affiliations:
  • -;-

  • Venue:
  • DCC '97 Proceedings of the Conference on Data Compression
  • Year:
  • 1997

Quantified Score

Hi-index 0.01

Visualization

Abstract

Data compression and learning are, in some sense, two sides of the same coin. If we paraphrase Occam's razor by saying that a small theory is better than a larger theory with the same explanatory power, we can characterize data compression as a preoccupation with small, and learning as a preoccupation with better. Nevill-Manning et al. (see Proc. Data Compression Conference, Los Alamitos, CA, p.244-253, 1994) presented an algorithm, since dubbed SEQUITUR, that presents both faces of the compression/learning coin. Its performance as a data compression scheme outstrips other dictionary schemes, and the structures that it learns from sequences as diverse as DNA and music are intuitively compelling. We present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al., and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multimegabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking.