Exact Distribution of a Spaced Seed Statistic for DNA Homology Detection

Authors:
Gary Benson;Denise Y. Mak
Affiliations:
Departments of Computer Science, Biology, Program in Bioinformatics, Boston University, Boston, MA 02215;Graduate Program in Bioinformatics, Boston University, Boston, MA 02215
Venue:
SPIRE '08 Proceedings of the 15th International Symposium on String Processing and Information Retrieval
Year:
2008

Citing 4
Cited 0

Efficient string matching: an aid to bibliographic search

Communications of the ACM
Better filtering with gapped q-grams

Fundamenta Informaticae - Special issue on computing patterns in strings
On spaced seeds for similarity search

Discrete Applied Mathematics
Designing seeds for similarity search in genomic DNA

Journal of Computer and System Sciences - Special issue on bioinformatics II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Let a seed , S , be a string from the alphabet {1,*}, of arbitrary length k , which starts and ends with a 1. For example, S = 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S . We say that the 1s at the corresponding positions in T are covered . We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p . We refer to this new probability distribution as C nSp , for covered , with S being the seed. We present an efficient method to calculate this distribution exactly . Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.