Compression of Biological Sequences by Greedy Off-Line Textual Substitution

Authors:
Alberto Apostolico;Stefano Lonardi
Affiliations:
-;-
Venue:
DCC '00 Proceedings of the Conference on Data Compression
Year:
2000

Citing 12
Cited 13

Data compression: methods and theory

Data compression: methods and theory
A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
On-line versus off-line computation in dynamic text compression

Information Processing Letters
Experiments in text file compression

Communications of the ACM
Discovery by Minimal Length Encoding: A Case Study in Molecular Evolution

Machine Learning
Compression of Strings with Approximate Repeats

ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology
A Guaranteed Compression Scheme for Repetitive DNA Sequences

DCC '96 Proceedings of the Conference on Data Compression
Significantly Lower Entropy Estimates for Natural DNA Sequences

DCC '97 Proceedings of the Conference on Data Compression
Protein Is Incompressible

DCC '99 Proceedings of the Conference on Data Compression
Data Compression Using Long Common Strings

DCC '99 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression

DCC '99 Proceedings of the Conference on Data Compression
Some Theory and Practice of Greedy Off-Line Textual Substitution

DCC '98 Proceedings of the Conference on Data Compression

Enhancing Data Migration Performance via Parallel Data Compression

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
DNA Sequence Compression Using the Burrows-Wheeler Transform

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
A Lossless Compression Algorithm for DNA sequences

International Journal of Bioinformatics Research and Applications
PPM with the extended alphabet

Information Sciences: an International Journal
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Searching a pattern in compressed DNA sequences

International Journal of Bioinformatics Research and Applications
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Non-repetitive DNA sequence compression using memoization

ISBMDA'06 Proceedings of the 7th international conference on Biological and Medical Data Analysis
Random access to grammar-compressed strings

Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Searching for smallest grammars on large sequences and application to DNA

Journal of Discrete Algorithms
Revisiting bounded context block-sorting transformations

Software—Practice & Experience
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113

Quantified Score

Hi-index	0.00

Visualization

Abstract

In bio-sequence repositories and other applications, like for instance in the production of a Cd-rom or magnetic disk for massive data dissemination, one could afford the extra cost of performing compression off-line in exchange for some gain in compression. In view of the intractability of optimal off-line macro schemes various approximate schemes have been considered.Here we follow one of the simplest possible steepest descent paradigms. This will consist of performing repeated stages in each one of which we identify a sub-string of the current version of the text yielding the maximum compression, and then replace all those occurrences except one with a pair of pointers to the untouched occurrence. This is somewhat dual with respect to the bottom up vocabulary buildup scheme considered by Rubin. This simple scheme already poses some interesting algorithmic problems.In terms of performance, the method does outperform current Lempel-Ziv implementations in most of the cases. Here we show that, on biological sequences, it beats all other generic compression methods and approaches the performance of methods specifically built around some peculiar regularities of DNA sequences, such as tandem repeats and palindromes, that are neither distinguished nor treated selectively here.The most interesting performances, however, are obtained in the compression of entire groups of genetic sequences forming families with similar characteristics. This is becoming a standard and useful way to group sequences in a growing number of important specialized databases. On such inputs, the approach presented here yields scores that are not only better than those of any other method, but also improve increasingly with increasing input size. This is to be attributed to a certain ability to capture distant relationships among the sequences in a family, a feature the merits of which were dramatically exposed in the recent paper [4].