A practical method for approximate subsequence search in DNA databases

Authors:
Jung-Im Won;Sang-Kyoon Hong;Jee-Hee Yoon;Sanghyun Park;Sang-Wook Kim
Affiliations:
College of Information and Communications, Hanyang University, Korea;Division of Information Engineering and Telecommunications, Hallym University, Korea;Division of Information Engineering and Telecommunications, Hallym University, Korea;Department of Computer Science, Yonsei University, Korea;College of Information and Communications, Hanyang University, Korea
Venue:
PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Year:
2007

Citing 9
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
FLASH: A Fast Look-Up Algorithm for String Homology

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
BLAST++: a tool for BLASTing queries in batches

APBC '03 Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics 2003 - Volume 19
OASIS: an online and accurate technique for local-alignment searches on biological sequences

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Practical suffix tree construction

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A novel indexing method for efficient sequence matching in large DNA database environment

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose an accurate and efficient method for approximate subsequence search in large DNA databases. The proposed method basically adopts a binary trie as its primary structure and stores all the window subsequences extracted from a DNA sequence. For approximate subsequence search, it traverses the binary trie in a breadth-first fashion and retrieves all the matched subsequences from the traversed path within the trie by a dynamic programming technique. However, the proposed method stores only window subsequences of the pre-determined length, and thus suffers from large post-processing time in case of long query sequences. To overcome this problem, we divide a query sequence into shorter pieces, perform searching for those subsequences, and then merge their results.