CSI: clustered segment indexing for efficient approximate searching on the secondary structure of protein sequences

Authors:
Minkoo Seo;Sanghyun Park;Jung-Im Won
Affiliations:
Department of Computer Science, Yonsei University, Korea;Department of Computer Science, Yonsei University, Korea;Department of Computer Science, Yonsei University, Korea
Venue:
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Year:
2005

Citing 6
Cited 0

String searching algorithms

String searching algorithms
Database indexing for large DNA and protein sequence collections

The VLDB Journal — The International Journal on Very Large Data Bases
An Efficient Index-based Protein Structure Database Searching Method

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Genomic information retrieval

ADC '03 Proceedings of the 14th Australasian database conference - Volume 17
Towards Index-based Similarity Search for Protein Structure Databases

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Searching on the secondary structure of protein sequences

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate searching on the primary structure (i.e., amino acid arrangement) of protein sequences is an essential part in predicting the functions and evolutionary histories of proteins. However, because proteins distant in an evolutionary history do not conserve amino acid residue arrangements, approximate searching on proteins' secondary structure is quite important in finding out distant homology. In this paper, we propose an indexing scheme for efficient approximate searching on the secondary structure of protein sequences which can be easily implemented in RDBMS. Exploiting the concept of clustering and lookahead, the proposed indexing scheme processes three types of secondary structure queries (i.e., exact match, range match, and wildcard match) very quickly. To evaluate the performance of the proposed method, we conducted extensive experiments using a set of actual protein sequences. According to the experimental results, the proposed method was proved to be faster than the existing indexing methods up to 6.3 times in exact match, 3.3 times in range match, and 1.5 times in wildcard match, respectively.