Identification of fundamental building blocks in protein sequences using statistical association measures

Authors:
Deborah Weisser;Judith Klein-Seetharaman
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;University of Pittsburgh, Pittsburgh, PA
Venue:
Proceedings of the 2004 ACM symposium on Applied computing
Year:
2004

Citing 3
Cited 2

Refining Neural Network Predictions for Helical Transmembrane Proteins by Dynamic Programming

Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology
A Hidden Markov Model for Predicting Transmembrane Helices in Protein Sequences

ISMB '98 Proceedings of the 6th International Conference on Intelligent Systems for Molecular Biology
Functional classification of proteins by pattern discovery and top-down clustering of primary sequences

IBM Systems Journal - Deep computing for the life sciences

Enhancing border security: Mutual information analysis to identify suspect vehicles

Decision Support Systems
Collaborative discovery through biological language modeling interface

Ambient Intelligence in Everyday Life

Quantified Score

Hi-index	0.00

Visualization

Abstract

Protein sequence data is abundant, yet derivation of structural features from sequence alone is generally restricted to prediction of domain architecture, secondary structure elements and motifs. Precise feature boundaries cannot be determined reliably, and it is unknown to what extent these features constitute fundamental building blocks of protein sequences, a question with particular relevance to protein folding. Here we propose a statistical approach using mutual information, a measure of association, to predict feature boundaries. In this approach, proteins are viewed as strings of adjacent, non-overlapping features, where each feature is a subsequence of the protein, and the union of the features is the entire protein. Mutual information values are measured between nearby amino acids along sequences, and low values are indicators for feature boundaries. These boundaries are then predicted using a flexible partitioning algorithm. The algorithms presented in this paper were tested on the GPCR protein family and subfamilies. A comparison with segment boundaries implied indirectly from secondary structure prediction and expert knowledge demonstrates that the algorithm can be used to statistically predict feature positions in protein sequences generically, without assumptions on the feature type to be detected. Access to the data used and algorithms presented in this paper are available at flan.blm.cs.cmu.edu.