Optimal spaced seeds for hidden Markov models, with application to homologous coding regions

Authors:
Broňa Brejová;Daniel G. Brown;Tomáš Vinař
Affiliations:
School of Computer Science, University of Waterloo, Waterloo, ON, Canada;School of Computer Science, University of Waterloo, Waterloo, ON, Canada;School of Computer Science, University of Waterloo, Waterloo, ON, Canada
Venue:
CPM'03 Proceedings of the 14th annual conference on Combinatorial pattern matching
Year:
2003

Citing 1
Cited 8

Designing seeds for similarity search in genomic DNA

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology

Designing multiple simultaneous seeds for DNA similarity search

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
Designing seeds for similarity search in genomic DNA

Journal of Computer and System Sciences - Special issue on bioinformatics II
Amino Acid Classification and Hash Seeds for Homology Search

BICoB '09 Proceedings of the 1st International Conference on Bioinformatics and Computational Biology
New algorithms for the spaced seeds

FAW'07 Proceedings of the 1st annual international conference on Frontiers in algorithmics
Quality of algorithms for sequence comparison

PReMI'11 Proceedings of the 4th international conference on Pattern recognition and machine intelligence
A unifying framework for seed sensitivity and its application to subset seeds

WABI'05 Proceedings of the 5th International conference on Algorithms in Bioinformatics
Seed design framework for mapping SOLiD reads

RECOMB'10 Proceedings of the 14th Annual international conference on Research in Computational Molecular Biology
Designing Filters for Fast-Known NcRNA Identification

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of computing optimal spaced seeds for detecting sequences generated by a Hidden Markov model. Inspired by recent work in DNA sequence alignment, we have developed such a model for representing the conservation between related DNA coding sequences. Our model includes positional dependencies and periodic rates of conservation, as well as regional deviations in overall conservation rate. We show that, for hidden Markov models in general, the probability that a seed is matched in a region can be computed efficiently, and use these methods to compute the optimal seed for our models. Our experiments on real data show that the optimal seeds are substantially more sensitive than the seeds used in the standard alignment program BLAST, and also substantially better than those of PatternHunter or WABA, both of which use spaced seeds. Our results offer the hope of improved gene finding due to fewer missed exons in DNA/DNA comparison, and more effective homology search in general, and may have applications outside of bioinformatics.