Protein Is Incompressible

  • Authors:
  • Craig G. Nevill-Manning;Ian H. Witten

  • Affiliations:
  • -;-

  • Venue:
  • DCC '99 Proceedings of the Conference on Data Compression
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Life is based on two polymers, DNA and protein, whose properties can be described in a simple text file. It is natural to expect that standard text compression techniques would work on biological sequences as they do on English text. But biological sequences have a fundamentally different structure from linguistic ones, and standard compression schemes exhibit disappointing performance on them. We describe a new approach to compression that takes account of the underlying biochemical principles. This gives rise to a generalization of blending for statistical compressors where every context is used, weighted by its similarity to the current context. Results support what research in bioinformatics has shown-that there is little Markov dependency in protein. This cripples data compression schemes and reduces them to order zero models.