Elements of information theory
Elements of information theory
DNA segmentation as a model selection process
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Bioinformatics: the machine learning approach
Bioinformatics: the machine learning approach
Average Case Analysis of Algorithms on Sequences
Average Case Analysis of Algorithms on Sequences
Statistical Identification of Uniformly Mutated Segments within Repeats
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
DCC '99 Proceedings of the Conference on Data Compression
DNA sequence compression using the normalized maximum likelihood model for discrete regression
DCC '03 Proceedings of the Conference on Data Compression
Low-complexity sequential lossless coding for piecewise-stationary memoryless sources
IEEE Transactions on Information Theory
IEEE Transactions on Information Theory
MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress
EURASIP Journal on Bioinformatics and Systems Biology
Segmentation with an isochore distribution
WABI'06 Proceedings of the 6th international conference on Algorithms in Bioinformatics
Hi-index | 0.00 |
The biological world is highly stochastic and inhomogeneous in its behaviour. There are regions in DNA with a high concentration of G or C bases; stretches of sequences with an abundance of CG dinucleotide (CpG islands); coding regions with strong periodicity-of-three pattern, and so forth. Transitions between these regions of DNA, known also as change points, carry important biological information. Computational methods used to identify these homogeneous regions are called segmentations. Viewing a DNA sequence as a non-stationary process, we apply recent novel techniques of universal source coding to discover stationary (possibly recurrent) segments. In particular, the Stein-Ziv lemma is adopted to find an asymptotically optimal discriminant function that determines whether two DNA segments are generated by the same source assuring exponentially small false positives. Next, we use the Minimum Description Length (MDL) principle to select parameters that lead to a linear-time segmentation algorithm. We apply our algorithm to human chromosome 9 and chromosome 20 to discover coding and noncoding regions, starting positions of genes, as well as the beginning of CpG islands.