Conditional graphical models for protein structure prediction

  • Authors:
  • Jaime Carbonell;John Lafferty;Eric P. Xing;Vanathi Gopalakrishna;Yan Liu

  • Affiliations:
  • Language Technologies Institute, School of Computer Science, Carnegie Mellon University;-;-;University of Pittsburgh;Language Technologies Institute, School of Computer Science, Carnegie Mellon University

  • Venue:
  • Conditional graphical models for protein structure prediction
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Protein structures play key roles in determining protein functions, activities, stability and subcellular localization. However, it is extremely time-consuming and expensive to determine experimentally the structures for millions of proteins using current techniques. For instance, it may take months to crystalize a single protein. In this thesis, we design computational methods to predict protein structures from their sequences in silico. In particular, we focus on predicting structural topology (as opposed to specific coordinates of each atom) at different levels in the protein structure hierarchy. Specifically, given a protein sequence, our goal is to predict its secondary structure elements, how they arrange themselves in three-dimensional space, and how multiple chains associate with each other to form one stable structure. In other words, we strive to predict secondary, tertiary and quaternary protein structures from primary sequences and biophysical constraints. In structural biology, traditional approaches for protein structure prediction are based on sequence similarities. They use string matching algorithms or generate probabilistic profile scores to find the most similar sequences in the protein database. These methods works well for simple structures with strongly conserved sequences, but fail when the structures are complex with many long-range interactions such as hydrogen and disulfide bonds among amino acids distant in sequence order. Moreover, evolution often preserves structures without preserving sequences. Hence structure prediction cannot rely just on sequence homology. These cases necessitate a more expressive model to capture the structural properties of proteins, and therefore developing a family of such predictive models is the core of this dissertation. A new type of undirected graphical models are built based on protein structure graphs, whose nodes represent the state of either residues or a secondary structure element and whose edges represent interactions (e.g. bonds) either between adjacent nodes in the sequence order or long-range interactions among nodes in the primary sequence that fold back to establish proximity in 3D space. A discriminative learning approach is defined over these graphs, where the conditional probability of the states given the observed sequences is defined directly as exponential functions on local and topological features, without any assumptions regarding the data generation process. Thus our framework is able to capture the structural properties of proteins directly, including any overlapping or long-range interaction features. Within this framework, we develop conditional random fields and kernel conditional random fields for protein secondary structure prediction; we extend these to create segmentation conditional random fields and chain graph model for tertiary fold recognition, and linked segmentation conditional random fields for quaternary fold prediction. These extensions are new contributions to machine learning, which enable direct modeling of long-distance interactions and enable scaling-up of conditional random fields to much larger complex structural prediction tasks. With respect to computational biology, we contribute a novel and comprehensive paradigm for modeling and predicting secondary, super-secondary, tertiary and quaternary protein structures, surpassing the state of the art both in expressive power and predictive accuracy, as demonstrated in our suite of experiments. Moreover, we predict a large number of previously-unresolved beta-helical structures from the Swissprot data base, three of which have been subsequently confirmed via X-ray crystallography, and none have been disconfirmed. We hope that this work may shed light on the fundamental processes in protein structure modeling and may enable better processes for synthetic large-molecule drug design.