Three Improvements to the BLASTP Search of Genome Databases

  • Authors:
  • Shawn Delaney;Greg Butler;Clement Lam;Larry Thiel

  • Affiliations:
  • -;-;-;-

  • Venue:
  • SSDBM '00 Proceedings of the 12th International Conference on Scientific and Statistical Database Management
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

The BLASTP program is a search tool for databases of protein sequences that is widely used by biologists as a first step in investigating new genome sequences. BLASTP finds high-scoring local alignments without gaps between a query sequence q and sequences s in the database. The score of an alignment is the sum of the scores of individual alignments between amino acids that make up the protein. These individual scores come from a scoring matrix modeling the rate of evolutionary mutation.Here we provide a detailed description of the original program and three separate optimizations to it. BLASTP consists of three steps, that we call neighborhood construction, hit detection, and hit extension. The three optimizations target hit extension since it accounts for 93% of the execution time. The first optimization alters the data representation of the query sequence and the related code for indexing the scoring matrix. The second optimization performs extensions in step-sizes of two rather than one. The third optimization forestalls the calling of the hit extension step in cases that are unlikely to lead to a high-scoring alignment. Individually, the three optimizations show speed ups of 15%, 48%, and 63% respectively.