Robust transmission of unbounded strings using Fibonacci representations
IEEE Transactions on Information Theory
Data compression using dynamic Markov modelling
The Computer Journal
Compression, information theory, and grammars: a unified approach
ACM Transactions on Information Systems (TOIS)
A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
XMill: an efficient compressor for XML data
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Estimating DNA sequence entropy
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Significantly Lower Entropy Estimates for Natural DNA Sequences
DCC '97 Proceedings of the Conference on Data Compression
Prediction by Grammatical Match
DCC '00 Proceedings of the Conference on Data Compression
DNA sequence compression using the normalized maximum likelihood model for discrete regression
DCC '03 Proceedings of the Conference on Data Compression
DCC '02 Proceedings of the Data Compression Conference
Compressing XML with Multiplexed Hierarchical PPM Models
DCC '01 Proceedings of the Data Compression Conference
Analysis and processing of compact text
COLING '82 Proceedings of the 9th conference on Computational linguistics - Volume 1
An efficient normalized maximum likelihood algorithm for DNA sequence compression
ACM Transactions on Information Systems (TOIS)
Grammar-based codes: a new class of universal lossless source codes
IEEE Transactions on Information Theory
Hi-index | 0.00 |
This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.