Data Compression Using Long Common Strings

Authors:
Jon Bentley;Douglas McIlroy
Affiliations:
-;-
Venue:
DCC '99 Proceedings of the Conference on Data Compression
Year:
1999

Citing 0
Cited 16

An approach to phrase selection for offline data compression

ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
A general-purpose compression scheme for large collections

ACM Transactions on Information Systems (TOIS)
Sending compressed messages to a learned receiver on a bidirectional line

Information Processing Letters
Learning Structure from Sequences, with Applications in a Digital Library

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
Block Merging for Off-Line Compression

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
String pattern matching for a deluge survival kit

Handbook of massive data sets
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
Software Compression in the Client/Server Environment

DCC '01 Proceedings of the Data Compression Conference
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Bigtable: a distributed storage system for structured data

OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Bigtable: A Distributed Storage System for Structured Data

ACM Transactions on Computer Systems (TOCS)
Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Searching for smallest grammars on large sequences and application to DNA

Journal of Discrete Algorithms
Choosing word occurrences for the smallest grammar problem

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
TDSC: a two-phase duplicate string compression algorithm

APWeb'12 Proceedings of the 14th international conference on Web Technologies and Applications
An effective heuristic for the smallest grammar problem

Proceedings of the 15th annual conference on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

White [1967] proposed compressing text by "replacing [a] repeated string by a reference to [an] earlier occurrence". Ziv and Lempel [1977, 1978] implemented this idea by cleverly representing strings that occur in a relatively small sliding window. We extend the basic idea to represent long common strings that may appear far apart in the input text.On typical English text, our method provides little compression; there are few long common strings to be exploited. Some files, though, do contain repeated long strings. Baker [1995] documented significant repetition in the code of large software systems. In a mathematical subroutine library, we found many blocks of code repeated across functions of type float, double, complex, and double complex; our method combined with a standard compression algorithm reduced the file to half the size given by the standard algorithm alone. Corpora of real documents, such as correspondence, news articles, or netnews often contain long duplications due to quoting or republication, or even plagiarism.We begin by illustrating the opportunity for compressing very long strings. We then survey Karp and Rabin's [1987] algorithm for string matching, and apply it to data compression. Experiments show the efficacy of the new method on some classes of input, and analysis shows that it is efficient in run time.