Faster Algorithms for Sampling and Counting Biological Sequences

  • Authors:
  • Christina Boucher

  • Affiliations:
  • David R. Cheriton School of Computer Science, University of Waterloo,

  • Venue:
  • SPIRE '09 Proceedings of the 16th International Symposium on String Processing and Information Retrieval
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A set of sequences S is pairwise bounded if the Hamming distance between any pair of sequences in S is at most 2d . The Consensus Sequence problem aims to discern between pairwise bounded sets that have a consensus, and if so, finding one such sequence s *, and those that do not. This problem is closely related to the motif-recognition problem, which abstractly models finding important subsequences in biological data. We give an efficient algorithm for sampling pairwise bounded sets, referred to as MarkovSampling, and show it generates pairwise bounded sets uniformly at random. We illustrate the applicability of MarkovSampling to efficiently solving motif-recognition instances. Computing the expected number of motif sets has been a long-standing open problem in motif-recognition [1,3]. We consider the related problem of counting the number of pairwise bounded sets, give new bounds on number of pairwise bounded sets, and present an algorithmic approach to counting the number of pairwise bounded sets.