Faster variance computation for patterns with gaps

Authors:
Fabio Cunial
Affiliations:
College of Computing, Georgia Institute of Technology, Atlanta, GA
Venue:
MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
Year:
2012

Citing 14
Cited 0

Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Verifying candidate matches in sparse and wildcard matching

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Monotony of surprise and large-scale quest for unusual words

Proceedings of the sixth annual international conference on Computational biology
Motif statistics

Theoretical Computer Science
A Statistical Method for Finding Transcription Factor Binding Sites

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Hidden Pattern Statistics

ICALP '01 Proceedings of the 28th International Colloquium on Automata, Languages and Programming,
An Output-Sensitive Flexible Pattern Discovery Algorithm

CPM '01 Proceedings of the 12th Annual Symposium on Combinatorial Pattern Matching
Annotated Statistical Indices for Sequence Analysis

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
STRING-MATCHING AND OTHER PRODUCTS

STRING-MATCHING AND OTHER PRODUCTS
Bases of Motifs for Generating Repeated Patterns with Wild Cards

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Reliable detection of episodes in event sequences

Knowledge and Information Systems
Conservative extraction of over-represented extensible motifs

Bioinformatics
Expectation of Strings with Mismatches under Markov Chain Distribution

SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
On the Complexity of Finite Sequences

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.