The nature of statistical learning theory
The nature of statistical learning theory
Bayesian Classification With Gaussian Processes
IEEE Transactions on Pattern Analysis and Machine Intelligence
An Introduction to Variational Methods for Graphical Models
Machine Learning
An introduction to support Vector Machines: and other kernel-based learning methods
An introduction to support Vector Machines: and other kernel-based learning methods
Predicting the &bgr;-helix fold from protein sequence data
RECOMB '01 Proceedings of the fifth annual international conference on Computational biology
Trilogy: discovery of sequence-structure patterns across diverse proteins
Proceedings of the sixth annual international conference on Computational biology
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Empirical Bayes for Learning to Learn
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Maximum Entropy Markov Models for Information Extraction and Segmentation
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Protein Secondary-Structure Modeling with Probabilistic Networks
Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology
A New Learning Algorithm for Mean Field Boltzmann Machines
ICANN '02 Proceedings of the International Conference on Artificial Neural Networks
Reversible Jump MCMC Simulated Annealing for Neural Networks
UAI '00 Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
ICCV '03 Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2
Kernel conditional random fields: representation and clique selection
ICML '04 Proceedings of the twenty-first international conference on Machine learning
A graphical model for protein secondary structure prediction
ICML '04 Proceedings of the twenty-first international conference on Machine learning
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Gaussian process classification for segmenting and annotating sequences
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Bayesian learning in undirected graphical models: approximate MCMC algorithms
UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Learning structured prediction models: a large margin approach
Learning structured prediction models: a large margin approach
Learning Multiple Tasks with Kernel Methods
The Journal of Machine Learning Research
Bioinformatics
Integer linear programming inference for conditional random fields
ICML '05 Proceedings of the 22nd international conference on Machine learning
Learning Gaussian processes from multiple tasks
ICML '05 Proceedings of the 22nd international conference on Machine learning
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Accelerated training of conditional random fields with stochastic gradient methods
ICML '06 Proceedings of the 23rd international conference on Machine learning
Hidden Conditional Random Fields for Gesture Recognition
CVPR '06 Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2
Contrastive estimation: training log-linear models on unlabeled data
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
A model of inductive bias learning
Journal of Artificial Intelligence Research
Hi-index | 0.00 |
Protein structures play key roles in determining protein functions, activities, stability and subcellular localization. However, it is extremely time-consuming and expensive to determine experimentally the structures for millions of proteins using current techniques. For instance, it may take months to crystalize a single protein. In this thesis, we design computational methods to predict protein structures from their sequences in silico. In particular, we focus on predicting structural topology (as opposed to specific coordinates of each atom) at different levels in the protein structure hierarchy. Specifically, given a protein sequence, our goal is to predict its secondary structure elements, how they arrange themselves in three-dimensional space, and how multiple chains associate with each other to form one stable structure. In other words, we strive to predict secondary, tertiary and quaternary protein structures from primary sequences and biophysical constraints. In structural biology, traditional approaches for protein structure prediction are based on sequence similarities. They use string matching algorithms or generate probabilistic profile scores to find the most similar sequences in the protein database. These methods works well for simple structures with strongly conserved sequences, but fail when the structures are complex with many long-range interactions such as hydrogen and disulfide bonds among amino acids distant in sequence order. Moreover, evolution often preserves structures without preserving sequences. Hence structure prediction cannot rely just on sequence homology. These cases necessitate a more expressive model to capture the structural properties of proteins, and therefore developing a family of such predictive models is the core of this dissertation. A new type of undirected graphical models are built based on protein structure graphs, whose nodes represent the state of either residues or a secondary structure element and whose edges represent interactions (e.g. bonds) either between adjacent nodes in the sequence order or long-range interactions among nodes in the primary sequence that fold back to establish proximity in 3D space. A discriminative learning approach is defined over these graphs, where the conditional probability of the states given the observed sequences is defined directly as exponential functions on local and topological features, without any assumptions regarding the data generation process. Thus our framework is able to capture the structural properties of proteins directly, including any overlapping or long-range interaction features. Within this framework, we develop conditional random fields and kernel conditional random fields for protein secondary structure prediction; we extend these to create segmentation conditional random fields and chain graph model for tertiary fold recognition, and linked segmentation conditional random fields for quaternary fold prediction. These extensions are new contributions to machine learning, which enable direct modeling of long-distance interactions and enable scaling-up of conditional random fields to much larger complex structural prediction tasks. With respect to computational biology, we contribute a novel and comprehensive paradigm for modeling and predicting secondary, super-secondary, tertiary and quaternary protein structures, surpassing the state of the art both in expressive power and predictive accuracy, as demonstrated in our suite of experiments. Moreover, we predict a large number of previously-unresolved beta-helical structures from the Swissprot data base, three of which have been subsequently confirmed via X-ray crystallography, and none have been disconfirmed. We hope that this work may shed light on the fundamental processes in protein structure modeling and may enable better processes for synthetic large-molecule drug design.