An approach to phrase selection for offline data compression
ACSC '02 Proceedings of the twenty-fifth Australasian conference on Computer science - Volume 4
A general-purpose compression scheme for large collections
ACM Transactions on Information Systems (TOIS)
Sending compressed messages to a learned receiver on a bidirectional line
Information Processing Letters
Learning Structure from Sequences, with Applications in a Digital Library
ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
Block Merging for Off-Line Compression
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
String pattern matching for a deluge survival kit
Handbook of massive data sets
Compression of Biological Sequences by Greedy Off-Line Textual Substitution
DCC '00 Proceedings of the Conference on Data Compression
Software Compression in the Client/Server Environment
DCC '01 Proceedings of the Data Compression Conference
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7
Bigtable: a distributed storage system for structured data
OSDI '06 Proceedings of the 7th symposium on Operating systems design and implementation
Bigtable: A Distributed Storage System for Structured Data
ACM Transactions on Computer Systems (TOCS)
Reference sequence construction for relative compression of genomes
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Searching for smallest grammars on large sequences and application to DNA
Journal of Discrete Algorithms
Choosing word occurrences for the smallest grammar problem
LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
TDSC: a two-phase duplicate string compression algorithm
APWeb'12 Proceedings of the 14th international conference on Web Technologies and Applications
An effective heuristic for the smallest grammar problem
Proceedings of the 15th annual conference on Genetic and evolutionary computation
Hi-index | 0.00 |
White [1967] proposed compressing text by "replacing [a] repeated string by a reference to [an] earlier occurrence". Ziv and Lempel [1977, 1978] implemented this idea by cleverly representing strings that occur in a relatively small sliding window. We extend the basic idea to represent long common strings that may appear far apart in the input text.On typical English text, our method provides little compression; there are few long common strings to be exploited. Some files, though, do contain repeated long strings. Baker [1995] documented significant repetition in the code of large software systems. In a mathematical subroutine library, we found many blocks of code repeated across functions of type float, double, complex, and double complex; our method combined with a standard compression algorithm reduced the file to half the size given by the standard algorithm alone. Corpora of real documents, such as correspondence, news articles, or netnews often contain long duplications due to quoting or republication, or even plagiarism.We begin by illustrating the opportunity for compressing very long strings. We then survey Karp and Rabin's [1987] algorithm for string matching, and apply it to data compression. Experiments show the efficacy of the new method on some classes of input, and analysis shows that it is efficient in run time.