Data Compression Using Long Common Strings

  • Authors:
  • Jon Bentley;Douglas McIlroy

  • Affiliations:
  • -;-

  • Venue:
  • DCC '99 Proceedings of the Conference on Data Compression
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

White [1967] proposed compressing text by "replacing [a] repeated string by a reference to [an] earlier occurrence". Ziv and Lempel [1977, 1978] implemented this idea by cleverly representing strings that occur in a relatively small sliding window. We extend the basic idea to represent long common strings that may appear far apart in the input text.On typical English text, our method provides little compression; there are few long common strings to be exploited. Some files, though, do contain repeated long strings. Baker [1995] documented significant repetition in the code of large software systems. In a mathematical subroutine library, we found many blocks of code repeated across functions of type float, double, complex, and double complex; our method combined with a standard compression algorithm reduced the file to half the size given by the standard algorithm alone. Corpora of real documents, such as correspondence, news articles, or netnews often contain long duplications due to quoting or republication, or even plagiarism.We begin by illustrating the opportunity for compressing very long strings. We then survey Karp and Rabin's [1987] algorithm for string matching, and apply it to data compression. Experiments show the efficacy of the new method on some classes of input, and analysis shows that it is efficient in run time.