An efficient normalized maximum likelihood algorithm for DNA sequence compression

Authors:
Gergely Korodi;Ioan Tabus
Affiliations:
Tampere University of Technology, Tampere, Finland;Tampere University of Technology, Tampere, Finland
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2005

Citing 7
Cited 9

A new challenge for compression algorithms: genetic sequences

Information Processing and Management: an International Journal - Special issue: data compression
Estimating DNA sequence entropy

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Classification and feature gene selection using the normalized maximum likelihood model for discrete regression

Signal Processing - Special issue: Genomic signal processing
Significantly Lower Entropy Estimates for Natural DNA Sequences

DCC '97 Proceedings of the Conference on Data Compression
DNA sequence compression using the normalized maximum likelihood model for discrete regression

DCC '03 Proceedings of the Conference on Data Compression
DNA Sequence Compression Using the Burrows-Wheeler Transform

CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics

Compression of Annotated Nucleotide Sequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
NML computation algorithms for tree-structured multinomial Bayesian networks

EURASIP Journal on Bioinformatics and Systems Biology
Variable Order Finite-Context Models in DNA Sequence Coding

IbPRIA '09 Proceedings of the 4th Iberian Conference on Pattern Recognition and Image Analysis
A Lossless Compression Algorithm for DNA sequences

International Journal of Bioinformatics Research and Applications
Searching a pattern in compressed DNA sequences

International Journal of Bioinformatics Research and Applications
Iterative Dictionary Construction for Compression of Large DNA Data Sets

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Complexity profiles of DNA sequences using finite-context models

USAB'11 Proceedings of the 7th conference on Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society: information Quality in e-Health
Compression of whole genome alignments using a mixture of finite-context models

ICIAR'12 Proceedings of the 9th international conference on Image Analysis and Recognition - Volume Part I
Optimized relative Lempel-Ziv compression of genomes

ACSC '11 Proceedings of the Thirty-Fourth Australasian Computer Science Conference - Volume 113

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article presents an efficient algorithm for DNA sequence compression, which achieves the best compression ratios reported over a test set commonly used for evaluating DNA compression programs. The algorithm introduces many refinements to a compression method that combines: (1) encoding by a simple normalized maximum likelihood (NML) model for discrete regression, through reference to preceding approximate matching blocks, (2) encoding by a first order context coding and (3) representing strings in clear, to make efficient use of the redundancy sources in DNA data, under fast execution times. One of the main algorithmic features is the constraint on the matching blocks to include reasonably long contiguous matches, which not only reduces significantly the search time, but also can be used to modify the NML model to exploit the constraint for getting smaller code lengths. The algorithm handles the changing statistics of DNA data in an adaptive way and by predictively encoding the matching pointers it is successful in compressing long approximate matches. Apart from comparison with previous DNA encoding methods, we present compression results for the recently published human genome data.