Provably sensitive Indexing strategies for biosequence similarity search

  • Authors:
  • Jeremy Buhler

  • Affiliations:
  • Washington University, St. Louis, MO

  • Venue:
  • Proceedings of the sixth annual international conference on Computational biology
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The field of algorithms for pairwise biosequence similarity search is dominated by heuristic methods of high efficiency but uncertain sensitivity. One reason that more formal string matching algorithms with sensitivity guarantees have not been applied to biosequences is that they cannot directly find similarities that score highly under substitution score functions such as the DNAPAM-TT [20], PAM [9], or BLOSUM [12] families of matrices. We describe a general technique, score simulation, to map ungapped similarity search problems using these score functions into the problem of finding pairs of strings that are close in Hamming space. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity searches with formal guarantees of sensitivity using arbitrary score functions. In particular, we introduce the lsh-all-pairs-sim algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.