Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

  • Authors:
  • Gary Benson;Denise Y. Mak

  • Affiliations:
  • Departments of Computer Science, Biology, Program in Bioinformatics, Boston University, Boston, MA 02215;Graduate Program in Bioinformatics, Boston University, Boston, MA 02215

  • Venue:
  • SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Let a seed , S , be a string from the alphabet {1,*}, of arbitrary length k , which starts and ends with a 1. For example, S = 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S . We say that the 1s at the corresponding positions in T are covered . We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p . We refer to this new probability distribution as C nSp , for covered , with S being the seed. We present an efficient method to calculate this distribution exactly . Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.