Searching on the secondary structure of protein sequences

Authors:
Laurie Hammel;Jignesh M. Patel
Affiliations:
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI;Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 15
Cited 5

Equi-depth multidimensional histograms

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Shoring up persistent applications

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Combining fuzzy information from multiple systems (extended abstract)

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
System R: relational approach to database management

ACM Transactions on Database Systems (TODS)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Introduction to the Theory of Computation

Introduction to the Theory of Computation
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Indexing and Retrieval for Genomic Databases

IEEE Transactions on Knowledge and Data Engineering
Multi-Dimensional Substring Selectivity Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Index Structures for String Databases

Proceedings of the 27th International Conference on Very Large Data Bases
A Database Index to Large Biological Sequences

Proceedings of the 27th International Conference on Very Large Data Bases
Supporting Incremental Join Queries on Ranked Inputs

Proceedings of the 27th International Conference on Very Large Data Bases
Universality of Serial Histograms

VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Fast Retrieval of Similar Subsequences in Long Sequence Databases

KDEX '99 Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange

Distance-function design and fusion for sequence data

Proceedings of the thirteenth ACM international conference on Information and knowledge management
A platform based on the multi-dimensional data modal for analysis of bio-molecular structures

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Towards Efficient Searching on the Secondary Structure of Protein Sequences

Fundamenta Informaticae - Special issue ISMIS'05
CSI: clustered segment indexing for efficient approximate searching on the secondary structure of protein sequences

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Towards Efficient Searching on the Secondary Structure of Protein Sequences

Fundamenta Informaticae - Special issue ISMIS'05

Quantified Score

Hi-index	0.00

Visualization

Abstract

In spite of the many decades of progress in database research, surprisingly scientists in the life sciences community still struggle with inefficient and awkward tools for querying biological data sets. This work highlights a specific problem involving searching large volumes of protein data sets based on their secondary structure. In this paper we define an intuitive query language that can be used to express queries on secondary structure and develop several algorithms for evaluating these queries. We implement these algorithms both in Periscope, a native system that we have built, and in a commercial ORDBMS. We show that the choice of algorithms can have a significant impact on query performance. As part of the Periscope implementation we have also developed a framework for optimizing these queries and for accurately estimating the costs of the various query evaluation plans. Our performance studies show that the proposed techniques are very efficient in the Periscope system and can provide scientists with interactive secondary structure querying options even on large protein data sets.