OASIS: an online and accurate technique for local-alignment searches on biological sequences

Authors:
Colin Meek;Jignesh M. Patel;Shruti Kasetty
Affiliations:
University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI;University of Michigan, Ann Arbor, MI
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 15
Cited 31

q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
Reducing the space requirement of suffix trees

Software—Practice & Experience
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Constructing Suffix Trees On-Line in Linear Time

Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Filter-and-Refine Algorithms for Subsequence Selection

IDEAS '02 Proceedings of the 2002 International Symposium on Database Engineering & Applications
Practical Software for Aligning ESTs to Human Genome

CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
A Metric Index for Approximate String Matching

LATIN '02 Proceedings of the 5th Latin American Symposium on Theoretical Informatics
One-dimensional and multi-dimensional substring selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Comparing Algorithms for Large-Scale Sequence Analysis

BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Similarity search of time-warped subsequences via a suffix tree

Information Systems

Piers: an efficient model for similarity search in DNA sequence databases

ACM SIGMOD Record
ESTmapper: Efficiently Aligning DNA Sequences to Genomes

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 7 - Volume 08
PSIST: Indexing Protein Structures Using Suffix Trees

CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
Practical methods for constructing suffix trees

The VLDB Journal — The International Journal on Very Large Data Bases
Exact match search in sequence data using suffix trees

Proceedings of the 14th ACM international conference on Information and knowledge management
An efficient approach for sequence matching in large DNA databases

Journal of Information Science
Reference-based indexing of sequence databases

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Constructing large suffix trees on a computational grid

Journal of Parallel and Distributed Computing
A novel filtration method in biological sequence databases

Pattern Recognition Letters
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
PSIST: A scalable approach to indexing protein structures using suffix trees

Journal of Parallel and Distributed Computing
Serial and parallel methods for i/o efficient suffix tree construction

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A comprehensive trainable error model for sung music queries

Journal of Artificial Intelligence Research
Indexing genomic sequences on the IBM Blue Gene

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Reference-based alignment in large sequence databases

Proceedings of the VLDB Endowment
A practical method for approximate subsequence search in DNA databases

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Exhaustive peptide searching using relations

BNCOD'07 Proceedings of the 24th British national conference on Databases
An experimental study of compressed indexing and local alignments of DNA

COCOA'07 Proceedings of the 1st international conference on Combinatorial optimization and applications
I/O efficient algorithms for serial and parallel suffix tree construction

ACM Transactions on Database Systems (TODS)
Embedding-based subsequence matching in time-series databases

ACM Transactions on Database Systems (TODS)
Compressed directed acyclic word graph with application in local alignment

COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
Parallel construction of large suffix trees on a PC cluster

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A novel indexing method for efficient sequence matching in large DNA database environment

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Indexing DNA sequences using q-grams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Information retrieval of sequential data in heterogeneous XML databases

AMR'05 Proceedings of the Third international conference on Adaptive Multimedia Retrieval: user, context, and feedback
ALAE: accelerating local alignment with affine gap exactly in biosequence databases

Proceedings of the VLDB Endowment
Approximate regional sequence matching for genomic databases

The VLDB Journal — The International Journal on Very Large Data Bases
A query based approach for mining evolving graphs

AusDM '09 Proceedings of the Eighth Australasian Data Mining Conference - Volume 101
Efficient parallel construction of suffix trees for genomes larger than main memory

Proceedings of the 20th European MPI Users' Group Meeting
Fast computation of entropic profiles for the detection of conservation in genomes

PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics
Discovering longest-lasting correlation in sequence databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable. The alternative to BLAST is to use an accurate algorithm, such as the Smith-Waterman (S-W) algorithm. However, these accurate algorithms are computationally very expensive, which limits their use in practice. This paper takes on the challenge of designing an accurate and efficient algorithm for evaluating local-alignment searches. To meet this goal, we propose a novel search algorithm, called OASIS. This algorithm employs a dynamic programming A*-search driven by a suffix-tree index that is built on the input data set. We experimentally evaluate OASIS and demonstrate that for an important class of searches, in which the query sequence lengths are small, OASIS is more than an order of magnitude faster than S-W. In addition, the speed of OASIS is comparable to BLAST. Furthermore, OASIS returns results in decreasing order of the matching score, making it possible to use OASIS in an online setting. Consequently, we believe that it may now be practically feasible to query large biological sequence data sets using an accurate local-alignment search algorithm.