Burst tries: a fast, efficient data structure for string keys
ACM Transactions on Information Systems (TOIS)
Improving table compression with combinatorial optimization
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Compror: on-line lossless data compression with a factor oracle
Information Processing Letters
DNA '00 Revised Papers from the 6th International Workshop on DNA-Based Computers: DNA Computing
Compression of Biological Sequences by Greedy Off-Line Textual Substitution
DCC '00 Proceedings of the Conference on Data Compression
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
An Optimal DNA Segmentation Based on the MDL Principle
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Improving table compression with combinatorial optimization
Journal of the ACM (JACM)
Substring compression problems
SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Provenance-based validation of e-science experiments
Web Semantics: Science, Services and Agents on the World Wide Web
Macromolecular sequence analysis using multiwindow Gabor representations
Signal Processing
Compressing proteomes: the relevance of medium range correlations
EURASIP Journal on Bioinformatics and Systems Biology
An optimal DNA segmentation based on the MDL principle
International Journal of Bioinformatics Research and Applications
Finite State Models for the Generation of Large Corpora of Natural Language Texts
Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008
Proceedings of the 11th Annual conference on Genetic and evolutionary computation
The subsequence composition of a string
Theoretical Computer Science
On prediction using variable order Markov models
Journal of Artificial Intelligence Research
A compact representation of nondeterministic (suffix) automata for the bit-parallel approach
CPM'10 Proceedings of the 21st annual conference on Combinatorial pattern matching
Searching a pattern in compressed DNA sequences
International Journal of Bioinformatics Research and Applications
A new approach to sequence representation of proteins in bioinformatics
MICAI'05 Proceedings of the 4th Mexican international conference on Advances in Artificial Intelligence
Efficient computation of substring equivalence classes with suffix arrays
CPM'07 Proceedings of the 18th annual conference on Combinatorial Pattern Matching
Hi-index | 0.00 |
Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown-that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.