Statistical Identification of Uniformly Mutated Segments within Repeats

Authors:
Süleyman Cenk Sahinalp;Evan E. Eichler;Paul W. Goldberg;Petra Berenbrink;Tom Friedetzky;Funda Ergün
Affiliations:
-;-;-;-;-;-
Venue:
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
Year:
2002

Citing 9
Cited 2

Finding multiple abrupt change points

Computational Statistics & Data Analysis
Approximate string matching: a simpler faster algorithm

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A new approach to sequence comparison: normalized sequence alignment

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Computation of Normalized Edit Distance and Applications

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Approximate String Matching

Proceedings of the 1983 International FCT-Conference on Fundamentals of Computation Theory
Efficient approximate and dynamic matching of patterns using a labeling paradigm

FOCS '96 Proceedings of the 37th Annual Symposium on Foundations of Computer Science
Approximate string matching in sublinear expected time

SFCS '90 Proceedings of the 31st Annual Symposium on Foundations of Computer Science

An Optimal DNA Segmentation Based on the MDL Principle

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
An optimal DNA segmentation based on the MDL principle

International Journal of Bioinformatics Research and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given a long string of characters from a constant size (w.l.o.g. binary) alphabet we present an algorithm to determine whether its characters have been generated by a single i.i.d. random source. More specifically, consider all possible k-coin models for generating a binary string S, where each bit of S is generated via an independent toss of one of the k coins in the model. The choice of which coin to toss is decided by a random walk on the set of coins where the probability of a coin change is much lower than the probability of using the same coin repeatedly. We present a statistical test procedure which, for any given S, determines whether the a posteriori probability for k = 1 is higher than for any other k 1. Our algorithm runs in time O(l4 log l), where l is the length of S, through a dynamic programming approach which exploits the convexity of the a posteriori probability for k.The problem we consider arises from two critical applications in analyzing long alignments between pairs of genomic sequences. A high alignment score between two DNA sequences usually indicates an evolutionary relationship, i.e. that the sequences have been generated as a result of one or more copy events followed by random point mutations. Such sequences may include functional regions (e.g. exons) as well as nonfunctional ones (e.g. introns). Functional regions withcrit ical importance exhibit much lower mutation rates than non-functional DNA (or DNA with non-critical functionality) due to selective pressures for conserving such regions. As a result, given an alignment between two highly similar genome sequences, it may be possible to distinguishf unctional regions from non-functional ones using variations in the mutation rate. Our test provides means for determining variations in the mutation rate and thus checking the existence of DNA regions of varying degrees of functionality. A second application for our test is in determining whether two highly similar, thus evolutionarily related, genome segments are the result of a single copy event or of a complex series of copies. This is particularly an issue in evolutionary studies of genome regions rich with repeat segments (especially non-functional tandemly repeated DNA). Our approachc an be used to distinguish simple copies from complex repeats again by exploiting variations in mutation rates.