Comparing Compressed Sequences for Faster Nucleotide BLAST Searches

Authors:
Michael Cameron;Hugh Williams
Affiliations:
-;-
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2007

Citing 9
Cited 2

A Four Russians algorithm for regular expression pattern matching

Journal of the ACM (JACM)
A subquadratic algorithm for approximate regular expression matching

Journal of Algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A sub-quadratic sequence alignment algorithm for unrestricted cost matrices

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Indexing Nucleotide Databases for Fast Query Evaluation

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Compact Encoding Strategies for DNA Sequence Similarity Search

Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
Improved Gapped Alignment in BLAST

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

BioExtract Server—An Integrated Workflow-Enabling System to Access and Analyze Heterogeneous, Distributed Biomolecular Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
A New Efficient Data Structure for Storage and Retrieval of Multiple Biosequences

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Molecular biologists, geneticists, and other life scientists use the BLAST homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of BLAST: BLASTP for searching protein collections and BLASTN for nucleotide collections. Surprisingly, BLASTN has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 BLAST paper [1] and no exact description has been published. It is important that BLASTN is state-of-the-art: Nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and they take many minutes to search on modern general purpose workstations. This paper proposes significant improvements to the BLASTN algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of BLASTN with no effect on accuracy and have been integrated into our new version of BLAST that is freely available for download from http://www.fsa-blast.org/.