A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Proceedings of the sixth annual international conference on Computational biology
A new discriminative kernel from probabilistic models
Neural Computation
Fast String Kernels using Inexact Matching for Protein Sequences
The Journal of Machine Learning Research
Introduction: Special issue on neural networks and kernel methods for structured domains
Neural Networks - Special issue on neural networks and kernel methods for structured domains
Protein homology detection with biologically inspired features and interpretable statistical models
International Journal of Data Mining and Bioinformatics
Incremental Kernel Machines for Protein Remote Homology Detection
HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
Classifying proteins using gapped Markov feature pairs
Neurocomputing
Novel machine learning methods for MHC Class I binding prediction
PRIB'10 Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Appropriate kernel functions for support vector machine learning with sequences of symbolic data
Proceedings of the First international conference on Deterministic and Statistical Methods in Machine Learning
Learning interpretable SVMs for biological sequence classification
RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
A hidden Markov model variant for sequence classification
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Efficient evaluation of large sequence kernels
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
2D similarity kernels for biological sequence classification
Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Fast Kernel methods for SVM sequence classifiers
WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
A family of feed-forward models for protein sequence classification
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
The gapped spectrum kernel for support vector machines
MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Biological Sequence Classification with Multivariate String Kernels
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Hi-index | 0.00 |
We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" 驴 short regions of the original profile that contribute almost all the weight of the SVM classification score 驴 and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.