q-gram based database searching using a suffix array (QUASAR)
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
Reducing the space requirement of suffix trees
Software—Practice & Experience
Indexing and Retrieval for Genomic Databases
IEEE Transactions on Knowledge and Data Engineering
Constructing Suffix Trees On-Line in Linear Time
Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture - Information Processing '92, Volume 1 - Volume I
Efficient Index Structures for String Databases
Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences
Proceedings of the 27th International Conference on Very Large Data Bases
A Fast Index for Semistructured Data
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Filter-and-Refine Algorithms for Subsequence Selection
IDEAS '02 Proceedings of the 2002 International Symposium on Database Engineering & Applications
Practical Software for Aligning ESTs to Human Genome
CPM '02 Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching
A Metric Index for Approximate String Matching
LATIN '02 Proceedings of the 5th Latin American Symposium on Theoretical Informatics
One-dimensional and multi-dimensional substring selectivity estimation
The VLDB Journal — The International Journal on Very Large Data Bases
Comparing Algorithms for Large-Scale Sequence Analysis
BIBE '01 Proceedings of the 2nd IEEE International Symposium on Bioinformatics and Bioengineering
Similarity search of time-warped subsequences via a suffix tree
Information Systems
ESTmapper: Efficiently Aligning DNA Sequences to Genomes
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 7 - Volume 08
PSIST: Indexing Protein Structures Using Suffix Trees
CSB '05 Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference
Practical methods for constructing suffix trees
The VLDB Journal — The International Journal on Very Large Data Bases
Exact match search in sequence data using suffix trees
Proceedings of the 14th ACM international conference on Information and knowledge management
An efficient approach for sequence matching in large DNA databases
Journal of Information Science
Reference-based indexing of sequence databases
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Constructing large suffix trees on a computational grid
Journal of Parallel and Distributed Computing
A novel filtration method in biological sequence databases
Pattern Recognition Letters
Practical suffix tree construction
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
PSIST: A scalable approach to indexing protein structures using suffix trees
Journal of Parallel and Distributed Computing
Serial and parallel methods for i/o efficient suffix tree construction
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
A comprehensive trainable error model for sung music queries
Journal of Artificial Intelligence Research
Indexing genomic sequences on the IBM Blue Gene
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Reference-based alignment in large sequence databases
Proceedings of the VLDB Endowment
A practical method for approximate subsequence search in DNA databases
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Exhaustive peptide searching using relations
BNCOD'07 Proceedings of the 24th British national conference on Databases
An experimental study of compressed indexing and local alignments of DNA
COCOA'07 Proceedings of the 1st international conference on Combinatorial optimization and applications
I/O efficient algorithms for serial and parallel suffix tree construction
ACM Transactions on Database Systems (TODS)
Embedding-based subsequence matching in time-series databases
ACM Transactions on Database Systems (TODS)
Compressed directed acyclic word graph with application in local alignment
COCOON'11 Proceedings of the 17th annual international conference on Computing and combinatorics
Parallel construction of large suffix trees on a PC cluster
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
A novel indexing method for efficient sequence matching in large DNA database environment
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Indexing DNA sequences using q-grams
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Information retrieval of sequential data in heterogeneous XML databases
AMR'05 Proceedings of the Third international conference on Adaptive Multimedia Retrieval: user, context, and feedback
ALAE: accelerating local alignment with affine gap exactly in biosequence databases
Proceedings of the VLDB Endowment
Approximate regional sequence matching for genomic databases
The VLDB Journal — The International Journal on Very Large Data Bases
A query based approach for mining evolving graphs
AusDM '09 Proceedings of the Eighth Australasian Data Mining Conference - Volume 101
Efficient parallel construction of suffix trees for genomes larger than main memory
Proceedings of the 20th European MPI Users' Group Meeting
Fast computation of entropic profiles for the detection of conservation in genomes
PRIB'13 Proceedings of the 8th IAPR international conference on Pattern Recognition in Bioinformatics
Discovering longest-lasting correlation in sequence databases
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
A common query against large protein and gene sequence data sets is to locate targets that are similar to an input query sequence. The current set of popular search tools, such as BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable. The alternative to BLAST is to use an accurate algorithm, such as the Smith-Waterman (S-W) algorithm. However, these accurate algorithms are computationally very expensive, which limits their use in practice. This paper takes on the challenge of designing an accurate and efficient algorithm for evaluating local-alignment searches. To meet this goal, we propose a novel search algorithm, called OASIS. This algorithm employs a dynamic programming A*-search driven by a suffix-tree index that is built on the input data set. We experimentally evaluate OASIS and demonstrate that for an important class of searches, in which the query sequence lengths are small, OASIS is more than an order of magnitude faster than S-W. In addition, the speed of OASIS is comparable to BLAST. Furthermore, OASIS returns results in decreasing order of the matching score, making it possible to use OASIS in an online setting. Consequently, we believe that it may now be practically feasible to query large biological sequence data sets using an accurate local-alignment search algorithm.