Offline Dictionary-Based Compression

Authors:
N. Jesper Larsson;Alistair Moffat
Affiliations:
-;-
Venue:
DCC '99 Proceedings of the Conference on Data Compression
Year:
1999

Citing 0
Cited 41

Approximating the smallest grammar: Kolmogorov complexity in natural models

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-byte Character Texts, and Semi-structured Texts

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
Learning Structure from Sequences, with Applications in a Digital Library

ALT '02 Proceedings of the 13th International Conference on Algorithmic Learning Theory
Multiple Pattern Matching Algorithms on Collage System

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Collage system: a unifying framework for compressed pattern matching

Theoretical Computer Science - Selected papers in honour of Setsuo Arikawa
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
Compressed Pattern Matching for Sequitur

DCC '01 Proceedings of the Data Compression Conference
Optimization of html automatically generated by wysiwyg programs

Proceedings of the 13th international conference on World Wide Web
Music information retrieval research and its context at the University of Waikato

Journal of the American Society for Information Science and Technology - Music information retrieval
String Matching Over Compressed Text on Handheld Devices Using Tagged Sub-Optimal Code (TSC)

Real-Time Systems
Block merging for off-line compression

Journal of the American Society for Information Science and Technology
A Run-Time Efficient Implementation of Compressed Pattern Matching Automata

CIAA '08 Proceedings of the 13th international conference on Implementation and Applications of Automata
Experiences with model inference assisted fuzzing

WOOT'08 Proceedings of the 2nd conference on USENIX Workshop on offensive technologies
Reducing Space Requirements for Disk Resident Suffix Arrays

DASFAA '09 Proceedings of the 14th International Conference on Database Systems for Advanced Applications
PPM with the extended alphabet

Information Sciences: an International Journal
Fast and Compact Web Graph Representations

ACM Transactions on the Web (TWEB)
Improving semistatic compression via phrase-based modeling

Information Processing and Management: an International Journal
Faster subsequence and don't-care pattern matching on compressed texts

CPM'11 Proceedings of the 22nd annual conference on Combinatorial pattern matching
Fast q-gram mining on SLP compressed strings

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Reference sequence construction for relative compression of genomes

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Relative Lempel-Ziv factorization for efficient storage and retrieval of web collections

Proceedings of the VLDB Endowment
Functional programs as compressed data

PEPM '12 Proceedings of the ACM SIGPLAN 2012 workshop on Partial evaluation and program manipulation
Phrase-Based pattern matching in compressed text

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
An efficient pattern matching algorithm on a subclass of context free grammars

DLT'04 Proceedings of the 8th international conference on Developments in Language Theory
Choosing word occurrences for the smallest grammar problem

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
Grammar-based compression in a streaming model

LATA'10 Proceedings of the 4th international conference on Language and Automata Theory and Applications
VISION: cloud-powered sight for all: showing the cloud what you see

Proceedings of the third ACM workshop on Mobile cloud computing and services
Self-Indexed Grammar-Based Compression

Fundamenta Informaticae
Wavelet trees for all

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Speeding up q-gram mining on grammar-based compressed texts

CPM'12 Proceedings of the 23rd Annual conference on Combinatorial Pattern Matching
Compressed text indexes with fast locate

CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Efficient LZ78 factorization of grammar compressed text

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Compressed representation of web and social networks via dense subgraphs

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
Fast q-gram mining on SLP compressed strings

Journal of Discrete Algorithms
XML compression via DAGs

Proceedings of the 16th International Conference on Database Theory
Space-efficient data structures for Top-k completion

Proceedings of the 22nd international conference on World Wide Web
Compressed automata for dictionary matching

CIAA'13 Proceedings of the 18th international conference on Implementation and Application of Automata
XML tree structure compression using RePair

Information Systems
FRESCO: Referential Compression of Highly Similar Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dictionary-based modelling is the mechanism used in many practical compression schemes. For example, the members of the two Ziv-Lempel families parse the input message into a sequence of phrases selected from a dictionary, and obtain compression since a reference to the phrase can be more compact than the phrase itself.In most implementations of dictionary-based compression the encoder operates online, incrementally inferring its dictionary of available phrases from previous parts of the message, and adjusting its dictionary after the transmission of each phrase. Doing so allows the dictionary to be transmitted implicitly, since the decoder simultaneously makes similar adjustments to its dictionary.An alternative approach { the topic explored in this paper { is to use the full message (or a large block of it) to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. Intuitively, the advantage of this offline approach is that with the benefit of having access to all of the message, it should be possible to optimize the choice of phrases so as to maximize compression performance. Indeed, we demonstrate that very good compression can be attained by an offline method without compromising the fast decoding that is a distinguishing characteristic of dictionary-based techniques.Several nontrivial sources of overhead { in terms of both computation resources required to perform the compression, and bits generated into the compressed message { have to be carefully managed as part of the offline process. To meet this challenge, we have developed a novel phrase derivation method and a compact dictionary encoding. In combination these two techniques produce the compression scheme repair, which is highly efficient, particularly in decompression.It should also be noted that while offline compression involves the disadvantage of having to store a large part of the message in memory for processing, the difference between doing this and storing the growing dictionary of an online compressor is illusory. Indeed, incremental dictionary-based algorithms maintain an equally large part of the message in memory as part of the dictionary; similarly, online predictive symbol-based context models occupy space that may be linear in the size of that part of the message on which prediction is based.Our scheme is offline only while inferring the dictionary, and during decompression bits are read and phrases written in a fully interleaved manner. Moreover, during decoding only a compact representation of the dictionary must be stored. Thus, during decompression, our approach has a space advantage over both incremental dictionary-based schemes and over context-based source models.