Efficient string matching: an aid to bibliographic search
Communications of the ACM
Better filtering with gapped q-grams
Fundamenta Informaticae - Special issue on computing patterns in strings
On spaced seeds for similarity search
Discrete Applied Mathematics
Designing seeds for similarity search in genomic DNA
Journal of Computer and System Sciences - Special issue on bioinformatics II
Hi-index | 0.00 |
Let a seed , S , be a string from the alphabet {1,*}, of arbitrary length k , which starts and ends with a 1. For example, S = 11*1. S occurs in a binary string T at position h if the length k substring of T ending at position h contains a 1 in every position where there is a 1 in S . We say that the 1s at the corresponding positions in T are covered . We are interested in calculating the probability distribution for the number of 1s covered by a seed S in an iid Bernoulli string of length n with probability of 1 equal to p . We refer to this new probability distribution as C nSp , for covered , with S being the seed. We present an efficient method to calculate this distribution exactly . Covered 1s represent matching positions detected in DNA sequences when using multiple hits of a spaced seed. Knowledge of the distribution provides a statistical threshold for distinguishing true homologies from randomly matching sequences.