Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases

Authors:
Ozgur Ozturk;Hakan Ferhatosmanoglu
Affiliations:
-;-
Venue:
BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Year:
2003

Citing 0
Cited 9

CoMRI: A Compressed Multi-Resolution Index Structure for Sequence Similarity Queries

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Sentence completion

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Piers: an efficient model for similarity search in DNA sequence databases

ACM SIGMOD Record
Survey on index based homology search algorithms

The Journal of Supercomputing
Volumetric-based detection scheme for multi-antenna FH/MFSK systems in the presence of multi-follower jamming

Signal Processing
Brief communication: An efficient similarity search based on indexing in large DNA databases

Computational Biology and Chemistry
Indexing methods for approximate dictionary searching: Comparative analysis

Journal of Experimental Algorithmics (JEA)
Filtering bio-sequence based on sequence descriptor

BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
Indexing DNA sequences using q-grams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present multi-dimensional indexing approach for first sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of thesefunctions. We experimentally compared their (a) approximation quality for k-Nearest Neighbor (k-NN) queries, (b) pruning ability and (c) approximation quality for 驴-range queries. Results for k-NN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e.Frequency and Wavelet Distance functions for 2-grams) perform significantly better than the others. We then develop effective index structures, based on R-trees and scalar quantization, on top of transformed vectors and distance functions. Promising results from theexperiments on real biosequence data sets are presented.