A Corpus for the Evaluation of Lossless Compression Algorithms

  • Authors:
  • Ross Arnold;Tim Bell

  • Affiliations:
  • -;-

  • Venue:
  • DCC '97 Proceedings of the Conference on Data Compression
  • Year:
  • 1997

Quantified Score

Hi-index 0.01

Visualization

Abstract

A number of authors have used the Calgary corpus of texts to provide empirical results for lossless compression algorithms. This corpus was collected in 1987, although it was not published until 1990. The advances with compression algorithms have been achieving relatively small improvements in compression, measured using the Calgary corpus. There is a concern that algorithms are being fine-tuned to this corpus, and that small improvements measured in this way may not apply to other files. Furthermore, the corpus is almost ten years old, and over this period there have been changes in the kinds of files that are compressed, particularly with the development of the Internet, and the rapid growth of high-capacity secondary storage for personal computers. We explore the issues raised above, and develop a principled technique for collecting a corpus of test data for compression methods. A corpus, called the Canterbury corpus, is developed using this technique, and we report the performance of a collection of compression methods using the new corpus.