On Subset Seeds for Protein Alignment

Authors:
Mikhail Roytberg;Anna Gambin;Laurent Noe;Slawomir Lasota;Eugenia Furletova;Ewa Szczurek;Gregory Kucherov
Affiliations:
Institute of Mathematical Problems in Biology, Pushchino, Moscow;Warsaw University, Poland;LIFL/CNRS/INRIA, France;Warsaw University, Poland;Institute of Mathematical Problems in Biology, Pushchino, Moscow;Max Planck Institute for Molecular Genetics, Berlin;LIFL/CNRS/INRIA, France
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2009

Citing 11
Cited 0

Designing seeds for similarity search in genomic DNA

RECOMB '03 Proceedings of the seventh annual international conference on Research in computational molecular biology
Constrained Independence System and Triangulations of Planar Point Sets

COCOON '95 Proceedings of the First Annual International Conference on Computing and Combinatorics
Designing multiple simultaneous seeds for DNA similarity search

RECOMB '04 Proceedings of the eighth annual international conference on Resaerch in computational molecular biology
On spaced seeds for similarity search

Discrete Applied Mathematics
Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds

BIBE '04 Proceedings of the 4th IEEE Symposium on Bioinformatics and Bioengineering
Optimizing Multiple Seeds for Protein Homology Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Vector seeds: An extension to spaced seeds

Journal of Computer and System Sciences - Special issue on bioinformatics II
tPatternHunter: gapped, fast and sensitive translated homology search

Bioinformatics
Indel seeds for homology search

Bioinformatics
Rapid Homology Search with Neighbor Seeds

Algorithmica
Improved BLAST searches using longer words for protein seeding

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We propose several different design methods and use them to construct several alphabets. We then perform a comparative analysis of seeds built over those alphabets and compare them with the standard Blastp seeding method [2], [3], as well as with the family of vector seeds proposed in [4]. While the formalism of subset seeds is less expressive (but less costly to implement) than the cumulative principle used in Blastp and vector seeds, our seeds show a similar or even better performance than Blastp on Bernoulli models of proteins compatible with the common BLOSUM62 matrix. Finally, we perform a large-scale benchmarking of our seeds against several main databases of protein alignments. Here again, the results show a comparable or better performance of our seeds versus Blastp.