Compression of Annotated Nucleotide Sequences

Authors:
Gergely Korodi;Ioan Tabus
Affiliations:
-;-
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2007

Citing 14
Cited 0

Robust transmission of unbounded strings using Fibonacci representations

IEEE Transactions on Information Theory
Data compression using dynamic Markov modelling

The Computer Journal
Compression, information theory, and grammars: a unified approach

ACM Transactions on Information Systems (TOIS)
A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
XMill: an efficient compressor for XML data

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Estimating DNA sequence entropy

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Significantly Lower Entropy Estimates for Natural DNA Sequences

DCC '97 Proceedings of the Conference on Data Compression
Prediction by Grammatical Match

DCC '00 Proceedings of the Conference on Data Compression
DNA sequence compression using the normalized maximum likelihood model for discrete regression

DCC '03 Proceedings of the Conference on Data Compression
PPM: One Step to Practicality

DCC '02 Proceedings of the Data Compression Conference
Compressing XML with Multiplexed Hierarchical PPM Models

DCC '01 Proceedings of the Data Compression Conference
Analysis and processing of compact text

COLING '82 Proceedings of the 9th conference on Computational linguistics - Volume 1
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)
Grammar-based codes: a new class of universal lossless source codes

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article introduces an algorithm for the lossless compression of DNA files, which contain annotation text besides the nucleotide sequence. First a grammar is specifically designed to capture the regularities of the annotation text. A revertible transformation uses the grammar rules in order to equivalently represent the original file as a collection of parsed segments and a sequence of decisions made by the grammar parser. This decomposition enables the efficient use of state-of-the-art encoders for processing the parsed segments. The output size of the decision-making process of the grammar is optimized by extending the states to account for high-order Markovian dependencies. The practical implementation of the algorithm achieves a significant improvement when compared to the general-purpose methods currently used for DNA files.