Data compression: methods and theory
Data compression: methods and theory
A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
On-line versus off-line computation in dynamic text compression
Information Processing Letters
Experiments in text file compression
Communications of the ACM
Compression of Strings with Approximate Repeats
ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology
A Guaranteed Compression Scheme for Repetitive DNA Sequences
DCC '96 Proceedings of the Conference on Data Compression
Significantly Lower Entropy Estimates for Natural DNA Sequences
DCC '97 Proceedings of the Conference on Data Compression
DCC '99 Proceedings of the Conference on Data Compression
Data Compression Using Long Common Strings
DCC '99 Proceedings of the Conference on Data Compression
Offline Dictionary-Based Compression
DCC '99 Proceedings of the Conference on Data Compression
Some Theory and Practice of Greedy Off-Line Textual Substitution
DCC '98 Proceedings of the Conference on Data Compression
Enhancing Data Migration Performance via Parallel Data Compression
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
DNA Sequence Compression Using the Burrows-Wheeler Transform
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
A Lossless Compression Algorithm for DNA sequences
International Journal of Bioinformatics Research and Applications
PPM with the extended alphabet
Information Sciences: an International Journal
Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
Searching a pattern in compressed DNA sequences
International Journal of Bioinformatics Research and Applications
Iterative Dictionary Construction for Compression of Large DNA Data Sets
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Non-repetitive DNA sequence compression using memoization
ISBMDA'06 Proceedings of the 7th international conference on Biological and Medical Data Analysis
Random access to grammar-compressed strings
Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms
DNA compression challenge revisited: a dynamic programming approach
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Searching for smallest grammars on large sequences and application to DNA
Journal of Discrete Algorithms
Revisiting bounded context block-sorting transformations
Software—Practice & Experience
Optimized relative Lempel-Ziv compression of genomes
ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113
Hi-index | 0.00 |
In bio-sequence repositories and other applications, like for instance in the production of a Cd-rom or magnetic disk for massive data dissemination, one could afford the extra cost of performing compression off-line in exchange for some gain in compression. In view of the intractability of optimal off-line macro schemes various approximate schemes have been considered.Here we follow one of the simplest possible steepest descent paradigms. This will consist of performing repeated stages in each one of which we identify a sub-string of the current version of the text yielding the maximum compression, and then replace all those occurrences except one with a pair of pointers to the untouched occurrence. This is somewhat dual with respect to the bottom up vocabulary buildup scheme considered by Rubin. This simple scheme already poses some interesting algorithmic problems.In terms of performance, the method does outperform current Lempel-Ziv implementations in most of the cases. Here we show that, on biological sequences, it beats all other generic compression methods and approaches the performance of methods specifically built around some peculiar regularities of DNA sequences, such as tandem repeats and palindromes, that are neither distinguished nor treated selectively here.The most interesting performances, however, are obtained in the compression of entire groups of genetic sequences forming families with similar characteristics. This is becoming a standard and useful way to group sequences in a growing number of important specialized databases. On such inputs, the approach presented here yields scores that are not only better than those of any other method, but also improve increasingly with increasing input size. This is to be attributed to a certain ability to capture distant relationships among the sequences in a family, a feature the merits of which were dramatically exposed in the recent paper [4].