An efficient DNA sequence searching method using position specific weighting scheme

Authors:
Woo-Cheol Kim;Sanghyun Park;Jung-Im Won;Sang-Wook Kim;Jee-Hee Yoon
Affiliations:
-;-;Department of Computer Science, Yonsei University, Korea;College of Information and Communications, Hanyang University, Korea;Division of Information Engineering and Telecommunications, Hallym University, Korea
Venue:
Journal of Information Science
Year:
2006

Citing 14
Cited 1

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
String searching algorithms

String searching algorithms
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Faster algorithms for string matching with k mismatches

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
A fast string searching algorithm

Communications of the ACM
Efficient string matching: an aid to bibliographic search

Communications of the ACM
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Compact suffix array: a space-efficient full-text index

Fundamenta Informaticae - Special issue on computing patterns in strings
Dictionary matching and indexing with errors and don't cares

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Developing Bioinformatics Computer Skills

Developing Bioinformatics Computer Skills

A B-Tree index extension to enhance response time and the life cycle of flash memory

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Exact match queries, wildcard match queries, and k mismatch queries are widely used in various molecular biology applications including the searching of ESTs (Expressed Sequence Tags) and DNA transcription factors. In this paper, we suggest an efficient indexing and processing mechanism for such queries. Our indexing method places a sliding window at every possible location of a DNA sequence and extracts its signature by considering the occurrence frequency of each nucleotide. It then stores a set of signatures using a multi-dimensional index such as the R*-tree. Also, by assigning a weight to each position of a window, it prevents signatures from being concentrated around a few spots in indexing space. Our query processing method converts a query sequence into a multi-dimensional rectangle and searches the index for the signatures overlapping with the rectangle. Experiments with real biological data sets have revealed that the proposed approach is at least 4.4 times, 2.1 times, and several orders of magnitude faster than the previous one in performing exact match, wildcard match, and k-mismatch queries, respectively.