Assessing the Statistical Significance of Overrepresented Oligonucleotides

Authors:
Alain Denise;Mireille Régnier;Mathias Vandenbogaert
Affiliations:
-;-;-
Venue:
WABI '01 Proceedings of the First International Workshop on Algorithms in Bioinformatics
Year:
2001

Citing 3
Cited 3

Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology

Most significant substring mining based on chi-square measure

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Mining statistically significant substrings using the chi-square statistic

Proceedings of the VLDB Endowment
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

Theoretical Computer Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assessing statistical significance of over-representation of exceptional words is becoming an important task in computational biology. We show on two problems how large deviation methodology applies. First, when some oligomer H occurs more often than expected, e.g. may be overrepresented, large deviations allow for a very efficient computation of the so-called p-value. The second problem we address is the possible changes in the oligomers distribution induced by the over-representation of some pattern. Discarding this noise allows for the detection of weaker signals. Related algorithmic and complexity issues are discussed and compared to previous results. The approach is illustrated with three typical examples of applications on biological data.