Common substrings in random strings

Authors:
Eric Blais;Mathieu Blanchette
Affiliations:
McGill Centre for Bioinformatics and School of Computer Science, McGill University, Montréal, Québec, Canada;McGill Centre for Bioinformatics and School of Computer Science, McGill University, Montréal, Québec, Canada
Venue:
CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Year:
2006

Citing 4
Cited 0

A unified approach to word occurrence probabilities

Discrete Applied Mathematics - Special volume on combinatorial molecular biology
Combinatorial Algorithms: For Computers and Hard Calculators

Combinatorial Algorithms: For Computers and Hard Calculators
Motif statistics

Theoretical Computer Science
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.