Significantly Lower Entropy Estimates for Natural DNA Sequences

Authors:
David Loewenstern;Peter N. Yianilos
Affiliations:
-;-
Venue:
DCC '97 Proceedings of the Conference on Data Compression
Year:
1997

Citing 0
Cited 7

Computationally Inspired Biotechnologies: Improved DNA Synthesis and Associative Search Using Error-Correcting Codes and Vector-Quantization

DNA '00 Revised Papers from the 6th International Workshop on DNA-Based Computers: DNA Computing
Compression of Biological Sequences by Greedy Off-Line Textual Substitution

DCC '00 Proceedings of the Conference on Data Compression
DNA Sequence Compression Using the Burrows-Wheeler Transform

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
The SCP and Compressed Domain Analysis of Biological Sequences

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
An efficient normalized maximum likelihood algorithm for DNA sequence compression

ACM Transactions on Information Systems (TOIS)
Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

If DNA were a random string over its alphabet {A,C,G,T}, an optimal code would assign 2 bits to each nucleotide. We imagine DNA to be a highly ordered, purposeful molecule, and might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than five-fold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using expectation maximization (EM).