Large deviation properties for patterns

Authors:
Jérémie Bourdon;Mireille Régnier
Affiliations:
LINA, CNRS UMR 6241, Université de Nantes, France and DYLISS-Inria team, Inria Rennes-Bretagne-Atlantique, France;AMIB-Inria team, LIX-Ecole Polytechnique, 91128 Palaiseau, France
Venue:
Journal of Discrete Algorithms
Year:
2014

Citing 4
Cited 0

A unified approach to word occurrence probabilities

Discrete Applied Mathematics - Special volume on combinatorial molecular biology
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
Motif statistics

Theoretical Computer Science
Assessing the significance of sets of words

CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching

Quantified Score

Hi-index	0.00

Visualization

Abstract

Deciding whether a given pattern is over- or under-represented according to a given background model is a key question in computational biology. Such a decision is usually made by computing some p-values reflecting the ''exceptionality'' of a pattern in a given sequence or set of sequences. In the simplest cases (short and simple patterns, simple background model, small number of sequences), an exact p-value can be computed with a tractable complexity. The realistic cases are in general too complicated to get such an exact p-value. Approximations are thus proposed (Gaussian, Poisson, Large deviation approximations). These approximations are applicable under some conditions: Gaussian approximations are valid in the central domain while Poisson and Large deviation approximations are valid for rare events. In the present paper, we prove a large deviation approximation to the double strands counting problem that refers to a counting of a given pattern in a set of sequences that arise from both strands of the genome. In that case, dependencies between a sequence and its reverse complement cannot be neglected. They are captured here for a Bernoulli model from general combinatorial properties of the pattern. A large deviation result is also provided for a set of small sequences.