Conditional graphical models for protein structure prediction

Authors:
Jaime Carbonell;John Lafferty;Eric P. Xing;Vanathi Gopalakrishna;Yan Liu
Affiliations:
Language Technologies Institute, School of Computer Science, Carnegie Mellon University;-;-;University of Pittsburgh;Language Technologies Institute, School of Computer Science, Carnegie Mellon University
Venue:
Conditional graphical models for protein structure prediction
Year:
2006

Citing 33
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Bayesian Classification With Gaussian Processes

IEEE Transactions on Pattern Analysis and Machine Intelligence
An Introduction to Variational Methods for Graphical Models

Machine Learning
An introduction to support Vector Machines: and other kernel-based learning methods

An introduction to support Vector Machines: and other kernel-based learning methods
Predicting the &bgr;-helix fold from protein sequence data

RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Trilogy: discovery of sequence-structure patterns across diverse proteins

Proceedings of the sixth annual international conference on Computational biology
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Empirical Bayes for Learning to Learn

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Protein Secondary-Structure Modeling with Probabilistic Networks

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
A New Learning Algorithm for Mean Field Boltzmann Machines

ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Reversible Jump MCMC Simulated Annealing for Neural Networks

UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Table extraction using conditional random fields

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification

ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Kernel conditional random fields: representation and clique selection

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A graphical model for protein secondary structure prediction

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Gaussian process classification for segmenting and annotating sequences

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Bayesian learning in undirected graphical models: approximate MCMC algorithms

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Learning structured prediction models: a large margin approach

Learning structured prediction models: a large margin approach
Learning Multiple Tasks with Kernel Methods

The Journal of Machine Learning Research
Protein secondary structure: entropy, correlations and prediction

Bioinformatics
UniProt archive

Bioinformatics
Comparison of probabilistic combination methods for protein secondary structure prediction

Bioinformatics
Integer linear programming inference for conditional random fields

ICML '05 Proceedings of the 22nd international conference on Machine learning
Learning Gaussian processes from multiple tasks

ICML '05 Proceedings of the 22nd international conference on Machine learning
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Accelerated training of conditional random fields with stochastic gradient methods

ICML '06 Proceedings of the 23rd international conference on Machine learning
Hidden Conditional Random Fields for Gesture Recognition

CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
A machine learning information retrieval approach to protein fold recognition

Bioinformatics
Contrastive estimation: training log-linear models on unlabeled data

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A model of inductive bias learning

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Protein structures play key roles in determining protein functions, activities, stability and subcellular localization. However, it is extremely time-consuming and expensive to determine experimentally the structures for millions of proteins using current techniques. For instance, it may take months to crystalize a single protein. In this thesis, we design computational methods to predict protein structures from their sequences in silico. In particular, we focus on predicting structural topology (as opposed to specific coordinates of each atom) at different levels in the protein structure hierarchy. Specifically, given a protein sequence, our goal is to predict its secondary structure elements, how they arrange themselves in three-dimensional space, and how multiple chains associate with each other to form one stable structure. In other words, we strive to predict secondary, tertiary and quaternary protein structures from primary sequences and biophysical constraints. In structural biology, traditional approaches for protein structure prediction are based on sequence similarities. They use string matching algorithms or generate probabilistic profile scores to find the most similar sequences in the protein database. These methods works well for simple structures with strongly conserved sequences, but fail when the structures are complex with many long-range interactions such as hydrogen and disulfide bonds among amino acids distant in sequence order. Moreover, evolution often preserves structures without preserving sequences. Hence structure prediction cannot rely just on sequence homology. These cases necessitate a more expressive model to capture the structural properties of proteins, and therefore developing a family of such predictive models is the core of this dissertation. A new type of undirected graphical models are built based on protein structure graphs, whose nodes represent the state of either residues or a secondary structure element and whose edges represent interactions (e.g. bonds) either between adjacent nodes in the sequence order or long-range interactions among nodes in the primary sequence that fold back to establish proximity in 3D space. A discriminative learning approach is defined over these graphs, where the conditional probability of the states given the observed sequences is defined directly as exponential functions on local and topological features, without any assumptions regarding the data generation process. Thus our framework is able to capture the structural properties of proteins directly, including any overlapping or long-range interaction features. Within this framework, we develop conditional random fields and kernel conditional random fields for protein secondary structure prediction; we extend these to create segmentation conditional random fields and chain graph model for tertiary fold recognition, and linked segmentation conditional random fields for quaternary fold prediction. These extensions are new contributions to machine learning, which enable direct modeling of long-distance interactions and enable scaling-up of conditional random fields to much larger complex structural prediction tasks. With respect to computational biology, we contribute a novel and comprehensive paradigm for modeling and predicting secondary, super-secondary, tertiary and quaternary protein structures, surpassing the state of the art both in expressive power and predictive accuracy, as demonstrated in our suite of experiments. Moreover, we predict a large number of previously-unresolved beta-helical structures from the Swissprot data base, three of which have been subsequently confirmed via X-ray crystallography, and none have been disconfirmed. We hope that this work may shed light on the fundamental processes in protein structure modeling and may enable better processes for synthetic large-molecule drug design.