Identification of fundamental building blocks in protein sequences using statistical association measures

  • Authors:
  • Deborah Weisser;Judith Klein-Seetharaman

  • Affiliations:
  • Carnegie Mellon University, Pittsburgh, PA;University of Pittsburgh, Pittsburgh, PA

  • Venue:
  • Proceedings of the 2004 ACM symposium on Applied computing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Protein sequence data is abundant, yet derivation of structural features from sequence alone is generally restricted to prediction of domain architecture, secondary structure elements and motifs. Precise feature boundaries cannot be determined reliably, and it is unknown to what extent these features constitute fundamental building blocks of protein sequences, a question with particular relevance to protein folding. Here we propose a statistical approach using mutual information, a measure of association, to predict feature boundaries. In this approach, proteins are viewed as strings of adjacent, non-overlapping features, where each feature is a subsequence of the protein, and the union of the features is the entire protein. Mutual information values are measured between nearby amino acids along sequences, and low values are indicators for feature boundaries. These boundaries are then predicted using a flexible partitioning algorithm. The algorithms presented in this paper were tested on the GPCR protein family and subfamilies. A comparison with segment boundaries implied indirectly from secondary structure prediction and expert knowledge demonstrates that the algorithm can be used to statistically predict feature positions in protein sequences generically, without assumptions on the feature type to be detected. Access to the data used and algorithms presented in this paper are available at flan.blm.cs.cmu.edu.