Assessing the significance of sets of words

Authors:
Valentina Boeva;Julien Clément;Mireille Régnier;Mathias Vandenbogaert
Affiliations:
Moscow State University, Vorob'evy Gory, Russia;Igm, Université de Marne-la-Vallée, France;Inria, Le Chesnay, France;Biozentrum, Basel Universitat, Switzerland
Venue:
CPM'05 Proceedings of the 16th annual conference on Combinatorial Pattern Matching
Year:
2005

Citing 7
Cited 3

The distribution of subword counts is usually normal

European Journal of Combinatorics
A unified approach to word occurrence probabilities

Discrete Applied Mathematics - Special volume on combinatorial molecular biology
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Average Case Analysis of Algorithms on Sequences

Average Case Analysis of Algorithms on Sequences
Motif statistics

Theoretical Computer Science
On the Approximate Pattern Occurrences in a Text

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997

Expectation of Strings with Mismatches under Markov Chain Distribution

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source

Theoretical Computer Science
Large deviation properties for patterns

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

Various criteria have been defined to evaluate the significance of sets of words, the computation of them often being difficult. We provide explicit expressions for the waiting time in such a context. In order to assess the significance of a cluster of potential binding sites, we extend them to the co-occurrence problem. We point out that these criteria values depend on a few fundamental parameters. We provide efficient algorithms to compute them, that rely on a combinatorial interpretation of the formulae. We show that our results are very tight in the so-called twilight zone and improve on previous rough approximations. One assumes that the text is generated according to a Markov stationary process. These results are developed for an extended model of consensus.