Fast search in DNA sequence databases using punctuation and indexing

  • Authors:
  • Yi Lu;Shiyong Lu;Jeffrey L. Ram

  • Affiliations:
  • Department of Computer Science, Wayne State University, Detroit, MI;Department of Computer Science, Wayne State University, Detroit, MI;Department of Physiology, Wayne State University, Detroit, MI

  • Venue:
  • ACST'06 Proceedings of the 2nd IASTED international conference on Advances in computer science and technology
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, Compressed-Punctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences. cp-BM encodes two bits to represent each A, T, C, G character (4-character 8 bit (4C8B) compression), plus punctuator characters to indicate unambiguously the encoding frame of the compressed target sequence, thereby solving the misalignment problem in searching patterns with ordinary 4C8B compression. cp-BM searches DNA patterns at least 6 times faster than AGREP for pattern lengths ≥ 128 and between 2-fold and 5-fold faster than d-BM for all pattern lengths. cp-BM's performance is enhanced by punctuator indexing and multiple punctuators, especially for short sequences, yielding greater than 10-fold enhancements compared to d-BM and AGREP. In addition, cp-BM outperformed BLAT for sequences 64 or more bases in length, and was more than three-fold faster for 256 base sequences.