An optimal DNA segmentation based on the MDL principle

Authors:
Wojciech Szpankowski;Wenhui Ren;Lukasz Szpankowski
Affiliations:
Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA.;Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA.;Cell and Molecular Biology, University of Michigan, Ann Arbor, MI 48104, USA
Venue:
International Journal of Bioinformatics Research and Applications
Year:
2005

Citing 9
Cited 2

Elements of information theory

Elements of information theory
DNA segmentation as a model selection process

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Bioinformatics: the machine learning approach

Bioinformatics: the machine learning approach
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
Statistical Identification of Uniformly Mutated Segments within Repeats

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Protein Is Incompressible

DCC '99 Proceedings of the Conference on Data Compression
DNA sequence compression using the normalized maximum likelihood model for discrete regression

DCC '03 Proceedings of the Conference on Data Compression
Low-complexity sequential lossless coding for piecewise-stationary memoryless sources

IEEE Transactions on Information Theory
Asymptotically optimal low-complexity sequential lossless coding for piecewise-stationary memoryless sources .I. The regular case

IEEE Transactions on Information Theory

MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress

EURASIP Journal on Bioinformatics and Systems Biology
Segmentation with an isochore distribution

WABI'06 Proceedings of the 6th international conference on Algorithms in Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The biological world is highly stochastic and inhomogeneous in its behaviour. There are regions in DNA with a high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicity-of-three pattern, and so forth. Transitions between these regions of DNA, known also as change points, carry important biological information. Computational methods used to identify these homogeneous regions are called segmentations. Viewing a DNA sequence as a non-stationary process, we apply recent novel techniques of universal source coding to discover stationary (possibly recurrent) segments. In particular, the Stein-Ziv lemma is adopted to find an asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source assuring exponentially small false positives. Next, we use the Minimum Description Length (MDL) principle to select parameters that lead to a linear-time segmentation algorithm. We apply our algorithm to human chromosome 9 and chromosome 20 to discover coding and noncoding regions, starting positions of genes, as well as the beginning of CpG islands.