DNA sequence compression using the normalized maximum likelihood model for discrete regression

  • Authors:
  • Ioan Tabus;Gergely Korodi;Jorma Rissanen

  • Affiliations:
  • -;-;-

  • Venue:
  • DCC '03 Proceedings of the Conference on Data Compression
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

We discuss how to use the normalized maximum likelihood (NML) model for encodingsequences known to have regularities in the form of approximate repetitions. We present aparticular version of the NML model for discrete regression, which is shown to provide avery powerful yet simple model for encoding the approximate repeats in DNA sequences.Combining the model of repeats with a simple first order Markov model we obtain a fastlossless compression method, which compares favorably with the existing DNA compressionprograms. It is remarkable that a simple model, which recursively updates a small numberof parameters, is able to reach the state of the art compression ratio for DNA sequencesobtained with much more complex models. Being a minimum description length (MDL)model, the NML model may later prove to be useful in studying global and local featuresof DNA or possibly of other biological sequences.