A new challenge for compression algorithms: genetic sequences
Information Processing and Management: an International Journal - Special issue: data compression
Data compression: the complete reference
Data compression: the complete reference
Introduction to data compression
Introduction to data compression
Estimating DNA sequence entropy
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
External memory algorithms and data structures: dealing with massive data
ACM Computing Surveys (CSUR)
DCC '99 Proceedings of the Conference on Data Compression
Provenance-based validation of e-science experiments
Web Semantics: Science, Services and Agents on the World Wide Web
Compressing proteomes: the relevance of medium range correlations
EURASIP Journal on Bioinformatics and Systems Biology
Hi-index | 0.00 |
A simple statistical block code in combination withthe LZW-based compression utilities gzip and compresshas been found to increase by a significant amount thelevel of compression possible for the proteins encoded inHaemophilus influenzae, the first fully sequencedgenome. The method yields an entropy value of 3.665bits per symbol (bps), which is 0.657 bps below themaximum of 4.322 bps and an improvement of 0.452bps over the best known to date of 4.118 bps usingMatsumoto, Sadakane, and Imai's lza-CTW algorithm.Calculations based on a compact inverse genetic codeshow that the genome has a maximum entropy of 1.757bps for the coding regions, with a possibly lower actualentropy. These results hint at the existence of hithertounexplored redundancies that do not show up in Markovmodels and are indicative of more internal structure thansuspected in both the protein and the genome.