SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Verifying candidate matches in sparse and wildcard matching
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Monotony of surprise and large-scale quest for unusual words
Proceedings of the sixth annual international conference on Computational biology
Theoretical Computer Science
A Statistical Method for Finding Transcription Factor Binding Sites
Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
An Output-Sensitive Flexible Pattern Discovery Algorithm
CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Annotated Statistical Indices for Sequence Analysis
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
STRING-MATCHING AND OTHER PRODUCTS
STRING-MATCHING AND OTHER PRODUCTS
Bases of Motifs for Generating Repeated Patterns with Wild Cards
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Reliable detection of episodes in event sequences
Knowledge and Information Systems
Expectation of Strings with Mismatches under Markov Chain Distribution
SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On the Complexity of Finite Sequences
IEEE Transactions on Information Theory
Hi-index | 0.00 |
Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.