Faster variance computation for patterns with gaps

  • Authors:
  • Fabio Cunial

  • Affiliations:
  • College of Computing, Georgia Institute of Technology, Atlanta, GA

  • Venue:
  • MedAlg'12 Proceedings of the First Mediterranean conference on Design and Analysis of Algorithms
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.