Accelerating Approximate Subsequence Search on Large Protein Sequence Databases

Authors:
Jiong Yang;Wei Wang;Yi Xia;Philip S. Yu
Affiliations:
-;-;-;-
Venue:
CSB '02 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Year:
2002

Citing 15
Cited 1

Introduction to algorithms

Introduction to algorithms
Approximate string-matching with q-grams and maximal matches

Theoretical Computer Science - Selected papers of the Combinatorial Pattern Matching School
Fundamentals of database systems (2nd ed.)

Fundamentals of database systems (2nd ed.)
Randomized algorithms

Randomized algorithms
Efficient implementation of suffix trees

Software—Practice & Experience
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Sequence homology detection through large scale pattern discovery

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
The string B-tree: a new data structure for string search in external memory and its applications

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Finding motifs using random projections

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Fast and simple character classes and bounded gaps pattern matching, with application to protein searching

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Indexing Text with Approximate q-Grams

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
Proximity Matching Using Fixed-Queries Trees

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Overcoming the Memory Bottleneck in Suffix Tree Construction

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science

Suffix trees for inputs larger than main memory

Information Systems

Quantified Score

Hi-index	0.01

Visualization

Abstract

Bioinformatics has become an active research area in recent years. The amount of mapped sequences doubles every fourteen months. BLAST has been widely employed for retrievingsequences which has similar portion(s) to a given sequence. However, BLAST has to scan the entire database every time when a query is issued. This can be very time consuming especially when the database is large. In this paper, we study the problem on how to build a persistent index structure for protein sequences to support approximate match. The suffix tree has been proposed as a solution to index sequence database and has been deployed on organizing DNA sequences (Hunt et al. 2001). Unfortunately, it suffers from the problem of "memory bottleneck" that prevents it from being applied efficiently to a large database. The performance even degrades further for protein database due to a larger fanout at each node. Here, we employ an indexing structure, called BASS-tree, to support approximate match in sublinear time on a large protein database. We call this indexing method as sequence approximate match (SAM) index method. The search of approximate matches can be properly directed to the portion in the database with a high potential of matching quickly. It has been demonstrated in our experiments that the potential performance improvement is in an order of magnitude over alternative methods such as the BLAST algorithm and the suffix tree.