Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure

Authors:
Darrin P. Lewis;Tony Jebara;William Stafford Noble
Affiliations:
Department of Computer Science, Columbia University New York, NY, 10027;Department of Computer Science, Columbia University New York, NY, 10027;Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington Seattle, WA, 98195, USA
Venue:
Bioinformatics
Year:
2006

Citing 0
Cited 13

Peptide programs: applying fragment programs to protein classification

Proceedings of the 2nd international workshop on Data and text mining in bioinformatics
Pattern recognition with a Bayesian kernel combination machine

Pattern Recognition Letters
Prediction of Protein Functions from Protein Interaction Networks: A Naïve Bayes Approach

PRICAI '08 Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence
Combining feature spaces for classification

Pattern Recognition
Evolutionary Optimization of Kernel Weights Improves Protein Complex Comembership Prediction

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Multiple Kernel Learning of Environmental Data. Case Study: Analysis and Mapping of Wind Fields

ICANN '09 Proceedings of the 19th International Conference on Artificial Neural Networks: Part II
Combined classifier for unknown genome classification using chaos game representation features

ISB '10 Proceedings of the International Symposium on Biocomputing
A conditional entropy minimization criterion for dimensionality reduction and multiple kernel learning

Neural Computation
Wavelet kernel learning

Pattern Recognition
Multiple Kernel Learning Algorithms

The Journal of Machine Learning Research
Using rotation forest for protein fold prediction problem: an empirical study

EvoBIO'10 Proceedings of the 8th European conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
Protein annotation from protein interaction networks and Gene Ontology

Journal of Biomedical Informatics
Predicting human miRNA target genes using a novel evolutionary methodology

SETN'12 Proceedings of the 7th Hellenic conference on Artificial Intelligence: theories and applications

Quantified Score

Hi-index	3.84

Visualization

Abstract

Motivation: Drawing inferences from large, heterogeneous sets of biological data requires a theoretical framework that is capable of representing, e.g. DNA and protein sequences, protein structures, microarray expression data, various types of interaction networks, etc. Recently, a class of algorithms known as kernel methods has emerged as a powerful framework for combining diverse types of data. The support vector machine (SVM) algorithm is the most popular kernel method, due to its theoretical underpinnings and strong empirical performance on a wide variety of classification tasks. Furthermore, several recently described extensions allow the SVM to assign relative weights to various datasets, depending upon their utilities in performing a given classification task. Results: In this work, we empirically investigate the performance of the SVM on the task of inferring gene functional annotations from a combination of protein sequence and structure data. Our results suggest that the SVM is quite robust to noise in the input datasets. Consequently, in the presence of only two types of data, an SVM trained from an unweighted combination of datasets performs as well or better than a more sophisticated algorithm that assigns weights to individual data types. Indeed, for this simple case, we can demonstrate empirically that no solution is significantly better than the naive, unweighted average of the two datasets. On the other hand, when multiple noisy datasets are included in the experiment, then the naive approach fares worse than the weighted approach. Our results suggest that for many applications, a naive unweighted sum of kernels may be sufficient. Availability: http://noble.gs.washington.edu/proj/seqstruct Contact: noble@gs.washington.edu Supplementary information: Supplementary Data are available at Bioinformatics online.