Peptide programs: applying fragment programs to protein classification
Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
A Study of Hierarchical and Flat Classification of Proteins
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Two-phase prediction of protein functions from biological literature based on Gini-Index
Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Computers in Biology and Medicine
Efficient prediction algorithms for binary decomposition techniques
Data Mining and Knowledge Discovery
2D similarity kernels for biological sequence classification
Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Biological Sequence Classification with Multivariate String Kernels
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multi-class problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vs-the-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure classification problem that we consider. Supplementary data and source code are available at http://www.cs.columbia.edu/compbio/adaptive.