A Block Coding Method that Leads to Significantly Lower Entropy Values for the Proteins and Coding Sections of Haemophilus influenzae

  • Authors:
  • G. Sampath

  • Affiliations:
  • -

  • Venue:
  • CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

A simple statistical block code in combination withthe LZW-based compression utilities gzip and compresshas been found to increase by a significant amount thelevel of compression possible for the proteins encoded inHaemophilus influenzae, the first fully sequencedgenome. The method yields an entropy value of 3.665bits per symbol (bps), which is 0.657 bps below themaximum of 4.322 bps and an improvement of 0.452bps over the best known to date of 4.118 bps usingMatsumoto, Sadakane, and Imai's lza-CTW algorithm.Calculations based on a compact inverse genetic codeshow that the genome has a maximum entropy of 1.757bps for the coding regions, with a possibly lower actualentropy. These results hint at the existence of hithertounexplored redundancies that do not show up in Markovmodels and are indicative of more internal structure thansuspected in both the protein and the genome.