A novel filtration method in biological sequence databases

Authors:
Anthony J. T. Lee;Chao-Wen Lin;Wen-Hsing Lo;Chieh-Chun Chen;Jia-Xin Chen
Affiliations:
Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC;Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC;Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC;Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC;Department of Information Management, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan, ROC
Venue:
Pattern Recognition Letters
Year:
2007

Citing 7
Cited 2

Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
q-gram based database searching using a suffix array (QUASAR)

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
General match: a subsequence matching method in time-series databases based on generalized windows

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Duality-Based Subsequence Matching in Time-Series Databases

Proceedings of the 17th International Conference on Data Engineering
Filtration of String Proximity Search via Transformation

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Accelerated off-target search algorithm for siRNA

Bioinformatics
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Noise Control Boundary Image Matching Using Time-Series Moving Average Transform

DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Scaling-invariant boundary image matching using time-series matching techniques

Data & Knowledge Engineering

Quantified Score

Hi-index	0.10

Visualization

Abstract

In this paper, we propose a new filtration method, called Transformation-based Database Filtration method (TDF), to screen out those data sequences of a DNA sequence database which cannot satisfy a given query sequence. Our proposed method consists of two phases. First, we divide each data sequence into several windows (blocks), each of which is transformed into a data feature vector using the Haar wavelet transform. The transformed data feature vectors are then stored in an index file. Second, we divide a query sequence into sliding windows, each of which is, again, transformed into a query feature vector using the Haar wavelet transform. We then search the index file to find the candidate sequences for each query feature vector and check if they match the query sequence using the sequence alignment algorithm. We transform the bound of edit distance between sequences to the bound of Manhattan distance between feature vectors. Since the Manhattan distance is much easier to compute, our proposed method can efficiently screen out impossible data sequences and guarantee no false negatives. The experimental results show that our proposed method outperforms the QUASAR method in terms of filtration ratio, precision, execution time and index size. The proposed method also outperforms the YM method for long query, low complexity and repetitive data.