Compressing proteomes: the relevance of medium range correlations
EURASIP Journal on Bioinformatics and Systems Biology
Searching a pattern in compressed DNA sequences
International Journal of Bioinformatics Research and Applications
Hi-index | 0.00 |
We consider the problem of compressibility of protein sequences. Based on an observed genome-scale long-range correlation in concatenated protein sequences from different organisms, we propose a method to exploit this unusual redundancy in compressing the protein sequences. The result is a significant reduction in the number of bits required for representing the sequences. We report results in bits per symbol (bps) of 2.27, 2.55, 3.11 and 3.44 for protein sequences from M. jannaschii, H. influenzae, S. cerevisiae, and H. sapiens respectively, the same protein sequences used by Nevill-Manning and Witten in the "Protein is incompressible"paper [23]. The observed long-range correlations could have significant implications beyond compression and complexity analysis of protein sequences.