Complexity profiles of DNA sequences using finite-context models

Authors:
Armando J. Pinho;Diogo Pratas;Sara P. Garcia
Affiliations:
Signal Processing Lab, IEETA / DETI, University of Aveiro, Aveiro, Portugal;Signal Processing Lab, IEETA / DETI, University of Aveiro, Aveiro, Portugal;Signal Processing Lab, IEETA / DETI, University of Aveiro, Aveiro, Portugal
Venue:
USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
Year:
2011

Citing 15
Cited 0

Text compression

Text compression
On the Length of Programs for Computing Finite Binary Sequences

Journal of the ACM (JACM)
A Guaranteed Compression Scheme for Repetitive DNA Sequences

DCC '96 Proceedings of the Conference on Data Compression
On Complexity Measures for Biological Sequences

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)
Data Compression: The Complete Reference

Data Compression: The Complete Reference
Introduction to Data Compression, Third Edition (Morgan Kaufmann Series in Multimedia Information and Systems)

Introduction to Data Compression, Third Edition (Morgan Kaufmann Series in Multimedia Information and Systems)
Normalized maximum likelihood model of order-1 for the compression of DNA sequences

DCC '07 Proceedings of the 2007 Data Compression Conference
A Simple Statistical Algorithm for Biological Sequence Compression

DCC '07 Proceedings of the 2007 Data Compression Conference
DNA coding using finite-context models and arithmetic coding

ICASSP '09 Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing
Generalized kraft inequality and arithmetic coding

IBM Journal of Research and Development
DNA compression challenge revisited: a dynamic programming approach

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Information distance

IEEE Transactions on Information Theory
The similarity metric

IEEE Transactions on Information Theory
Clustering by compression

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Every data compression method assumes a certain model of the information source that produces the data. When we improve a data compression method, we are also improving the model of the source. This happens because, when the probability distribution of the assumed source model is closer to the true probability distribution of the source, a smaller relative entropy results and, therefore, fewer redundancy bits are required. This is why the importance of data compression goes beyond the usual goal of reducing the storage space or the transmission time of the information. In fact, in some situations, seeking better models is the main aim. In our view, this is the case for DNA sequence data. In this paper, we give hints on how finite-context (Markov) modeling may be used for DNA sequence analysis, through the construction of complexity profiles of the sequences. These profiles are able to unveil structures of the DNA, some of them with potential biological relevance.