Multi-class Protein Classification Using Adaptive Codes

Authors:
Iain Melvin;Eugene Ie;Jason Weston;William Stafford Noble;Christina Leslie
Affiliations:
-;-;-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2007

Citing 0
Cited 9

Peptide programs: applying fragment programs to protein classification

Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
A Study of Hierarchical and Flat Classification of Proteins

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
An empirical study of binary classifier fusion methods for multiclass classification

Information Fusion
Statistical approaches to combining binary classifiers for multi-class classification

Neurocomputing
Two-phase prediction of protein functions from biological literature based on Gini-Index

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Remote protein homology detection and fold recognition using two-layer support vector machine classifiers

Computers in Biology and Medicine
Efficient prediction algorithms for binary decomposition techniques

Data Mining and Knowledge Discovery
2D similarity kernels for biological sequence classification

Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Biological Sequence Classification with Multivariate String Kernels

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multi-class problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vs-the-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure classification problem that we consider. Supplementary data and source code are available at http://www.cs.columbia.edu/compbio/adaptive.