A Block Coding Method that Leads to Significantly Lower Entropy Values for the Proteins and Coding Sections of Haemophilus influenzae

Authors:
G. Sampath
Affiliations:
-
Venue:
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Year:
2003

Citing 6
Cited 2

A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
Data compression: the complete reference

Data compression: the complete reference
Introduction to data compression

Introduction to data compression
Estimating DNA sequence entropy

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures: dealing with massive data

ACM Computing Surveys (CSUR)
Protein Is Incompressible

DCC '99 Proceedings of the Conference on Data Compression

Provenance-based validation of e-science experiments

Web Semantics: Science, Services and Agents on the World Wide Web
Compressing proteomes: the relevance of medium range correlations

EURASIP Journal on Bioinformatics and Systems Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

A simple statistical block code in combination withthe LZW-based compression utilities gzip and compresshas been found to increase by a significant amount thelevel of compression possible for the proteins encoded inHaemophilus influenzae, the first fully sequencedgenome. The method yields an entropy value of 3.665bits per symbol (bps), which is 0.657 bps below themaximum of 4.322 bps and an improvement of 0.452bps over the best known to date of 4.118 bps usingMatsumoto, Sadakane, and Imai's lza-CTW algorithm.Calculations based on a compact inverse genetic codeshow that the genome has a maximum entropy of 1.757bps for the coding regions, with a possibly lower actualentropy. These results hint at the existence of hithertounexplored redundancies that do not show up in Markovmodels and are indicative of more internal structure thansuspected in both the protein and the genome.