Clustering near-identical sequences for fast homology search

Authors:
Michael Cameron;Yaniv Bernstein;Hugh E. Williams
Affiliations:
School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;School of Computer Science and Information Technology, RMIT University, Melbourne, Australia;Microsoft Corporation, Redmond, Washington
Venue:
RECOMB'06 Proceedings of the 10th annual international conference on Research in Computational Molecular Biology
Year:
2006

Citing 9
Cited 0

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Constructing Suffix Tree for Gigabyte Sequences with Megabyte Memory

IEEE Transactions on Knowledge and Data Engineering
Improved Gapped Alignment in BLAST

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Redundant documents and search effectiveness

Proceedings of the 14th ACM international conference on Information and knowledge management
Finding similar files in a large file system

WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Fast discovery of similar sequences in large genomic collections

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new approach to managing redundancy in sequence databanks such as GenBank. We store clusters of near-identical sequences as a representative union-sequence and a set of corresponding edits to that sequence. During search, the query is compared to only the union-sequences representing each cluster; cluster members are then only reconstructed and aligned if the union-sequence achieves a sufficiently high score. Using this approach in BLAST results in a 27% reduction is collection size and a corresponding 22% decrease in search time with no significant change in accuracy. We also describe our method for clustering that uses fingerprinting, an approach that has been successfully applied to collections of text and web documents in Information Retrieval. Our clustering approach is ten times faster on the GenBank nonredundant protein database than the fastest existing approach, CD-HIT. We have integrated our approach into FSA-BLAST, our new Open Source version of BLAST, available from http://www.fsa-blast.org/. As a result, FSA-BLAST is twice as fast as NCBI-BLAST with no significant change in accuracy.