Motif discovery by monotone scores

Authors:
Alberto Apostolico;Cinzia Pizzi
Affiliations:
Dipartimento di Ingegneria dell' Informazione, Universití di Padova, Via Gradenigo 6/A, I-35131 Padova, Italy and College of Computing, Georgia Institute of Technology, 801 Atlantic Drive, At ...;Dipartimento di Ingegneria dell' Informazione, Universití di Padova, Via Gradenigo 6/A, I-35131 Padova, Italy and Department of Computer Science, University of Helsinki, P.O. Box 68 (Gustaf H ...
Venue:
Discrete Applied Mathematics
Year:
2007

Citing 7
Cited 3

Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization

Machine Learning - Special issue on applications in molecular biology
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Pattern matching algorithms

Pattern matching algorithms
Monotony of surprise and large-scale quest for unusual words

Proceedings of the sixth annual international conference on Computational biology
Finding motifs in the twilight zone

Proceedings of the sixth annual international conference on Computational biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing

Note: k-difference matching in amortized linear time for all the words in a text

Theoretical Computer Science
Efficient automatic exact motif discovery algorithms for biological sequences

Expert Systems with Applications: An International Journal
Expectation of Strings with Mismatches under Markov Chain Distribution

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval

Quantified Score

Hi-index	0.05

Visualization

Abstract

The detection of frequent patterns such as motifs and higher aggregates is of paramount interest in biology and invests many other applications of automated discovery. The problem with its variants is usually plagued with computational burden. A related difficulty is posed by the fact, that due to the sheer mole of candidates, the tables and indices at the outset tend to be bulky, un-manageable, and ultimately uninformative. For solid patterns, it is possible to compact the size of statistical indices by resort to certain monotonicities exhibited by popular scores. The savings come from the fact that these monotonicities enable one to partition the candidate over-represented words into families in such a way that it suffices to consider and weigh only one candidate per family. In this paper, we study the problem of extracting, from given source x and error threshold k, substrings of x that occur unusually often in x within k substitutions or mismatches. Specifically, we assume that the input textstring x of n characters is produced by an i.i.d. source, and design efficient methods for computing the probability and expected number of occurrences for substrings of x with (either exactly or up to) k mismatches. Two related schemes are presented. In the first one, an O(nk) time preprocessing of x is developed that supports the following subsequent query: for any substring w of x arbitrarily specified as input, the probability of occurrence of w in x within (either exactly or up to) k mismatches is reported in O(k^2) time. In the second scheme, a length or length range is arbitrarily specified, and the above probabilities are computed for all substrings of x having length in that range, in overall O(nk) time. Further, monotonicity conditions are introduced and studied for the probability and expected frequency of a substring under extension, increased number of errors, or both. Over intervals of constant frequency count, these monotonicities translate to some of the scores in use, thereby reducing the size of tables at the outset and enhancing the process of discovery. These latter derivations extend to patterns with mismatches an analysis previously devoted to exact patterns.