A machine learning information retrieval approach to protein fold recognition

Authors:
Jianlin Cheng;Pierre Baldi
Affiliations:
Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California Irvine, CA, USA;Institute for Genomics and Bioinformatics, School of Information and Computer Sciences, University of California Irvine, CA, USA
Venue:
Bioinformatics
Year:
2006

Citing 0
Cited 15

Combining One-Class Classification Models Based on Diverse Biological Data for Prediction of Protein-Protein Interactions

DILS '08 Proceedings of the 5th international workshop on Data Integration in the Life Sciences
Natural computing methods in bioinformatics: A survey

Information Fusion
A 9-state hidden Markov model using protein secondary structure information for protein fold recognition

Computers in Biology and Medicine
Supervised machine learning algorithms for protein structure classification

Computational Biology and Chemistry
Boosting Protein Threading Accuracy

RECOMB 2'09 Proceedings of the 13th Annual International Conference on Research in Computational Molecular Biology
Protein quaternary fold recognition using conditional graphical models

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
SVM-BetaPred: prediction of right-handed β-helix fold from protein sequence using SVM

PRIB'07 Proceedings of the 2nd IAPR international conference on Pattern recognition in bioinformatics
Protein fold recognition based upon the amino acid occurrence

PRIB'07 Proceedings of the 2nd IAPR international conference on Pattern recognition in bioinformatics
Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Conditional graphical models for protein structure prediction

Conditional graphical models for protein structure prediction
Efficient evaluation of large sequence kernels

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
2D similarity kernels for biological sequence classification

Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Protein fold recognition with a two-layer method based on SVM-SA, WP-NN and C4.5 TLM-SNC

International Journal of Data Mining and Bioinformatics
FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds

Computers in Biology and Medicine
Biological Sequence Classification with Multivariate String Kernels

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein. Results: Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile--profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is ∼85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%. Availability: The FOLDpro server is available with the SCRATCH suite through http://www.igb.uci.edu/servers/psss.html. Contact: pfbaldi@ics.uci.edu Supplementary information: Supplementary data are available at http://mine5.ics.uci.edu:1026/gain.html