Piers: an efficient model for similarity search in DNA sequence databases

Authors:
Xia Cao;Shuai Cheng Li;Beng Chin Ooi;Anthony K. H. Tung
Affiliations:
National University of Singapore, Singapore;National University of Singapore, Singapore;National University of Singapore, Singapore;National University of Singapore, Singapore
Venue:
ACM SIGMOD Record
Year:
2004

Citing 11
Cited 5

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Approximate nearest neighbors and sequence comparison with block operations

STOC '00 Proceedings of the thirty-second annual ACM symposium on Theory of computing
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
The ed-tree: an index for large DNA sequence databases

SSDBM '03 Proceedings of the 15th International Conference on Scientific and Statistical Database Management
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

ScalaBLAST: A Scalable Implementation of BLAST for High-Performance Data-Intensive Bioinformatics Analysis

IEEE Transactions on Parallel and Distributed Systems
Survey on index based homology search algorithms

The Journal of Supercomputing
Brief communication: An efficient similarity search based on indexing in large DNA databases

Computational Biology and Chemistry
Indexing DNA sequences using q-grams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
ALAE: accelerating local alignment with affine gap exactly in biosequence databases

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Growing interest in genomic research has resulted in the creation of huge biological sequence databases. In this paper, we present a hash-based pier model for efficient homology search in large DNA sequence databases. In our model, only certain segments in the databases called 'piers' need to be accessed during searches as opposite to other approaches which require a full scan on the biological sequence database. To further improve the search efficiency, the piers are stored in a specially designed hash table which helps to avoid expensive alignment operation. The has table is small enough to reside in main memory, hence avoiding I/O in the search steps. We show theoretically and empirically that the proposed approach can efficiently detect biological sequences that are similar to a query sequence with very high sensitivity.