Profile-Based String Kernels for Remote Homology Detection and Motif Extraction

Authors:
Rui Kuang;Eugene Ie;Ke Wang;Kai Wang;Mahira Siddiqi;Yoav Freund;Christina Leslie
Affiliations:
Columbia University;Columbia University;Columbia University;Columbia University;Columbia University;Columbia University;Columbia University
Venue:
CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Year:
2004

Citing 4
Cited 17

A training algorithm for optimal margin classifiers

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Proceedings of the sixth annual international conference on Computational biology
A new discriminative kernel from probabilistic models

Neural Computation
Mismatch string kernels for discriminative protein classification

Bioinformatics

Fast String Kernels using Inexact Matching for Protein Sequences

The Journal of Machine Learning Research
Introduction: Special issue on neural networks and kernel methods for structured domains

Neural Networks - Special issue on neural networks and kernel methods for structured domains
Sequence-similarity kernels for SVMs to detect anomalies in system calls

Neurocomputing
Protein homology detection with biologically inspired features and interpretable statistical models

International Journal of Data Mining and Bioinformatics
Incremental Kernel Machines for Protein Remote Homology Detection

HAIS '09 Proceedings of the 4th International Conference on Hybrid Artificial Intelligence Systems
Classifying proteins using gapped Markov feature pairs

Neurocomputing
Novel machine learning methods for MHC Class I binding prediction

PRIB'10 Proceedings of the 5th IAPR international conference on Pattern recognition in bioinformatics
Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Appropriate kernel functions for support vector machine learning with sequences of symbolic data

Proceedings of the First international conference on Deterministic and Statistical Methods in Machine Learning
Learning interpretable SVMs for biological sequence classification

RECOMB'05 Proceedings of the 9th Annual international conference on Research in Computational Molecular Biology
A hidden Markov model variant for sequence classification

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Two
Efficient evaluation of large sequence kernels

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
2D similarity kernels for biological sequence classification

Proceedings of the 11th International Workshop on Data Mining in Bioinformatics
Fast Kernel methods for SVM sequence classifiers

WABI'07 Proceedings of the 7th international conference on Algorithms in Bioinformatics
A family of feed-forward models for protein sequence classification

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
The gapped spectrum kernel for support vector machines

MLDM'13 Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition
Biological Sequence Classification with Multivariate String Kernels

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce novel profile-based string kernels for use with support vector machines (SVMs) for the problems of protein classification and remote homology detection. These kernels use probabilistic profiles, such as those produced by the PSI-BLAST algorithm, to define position-dependent mutation neighborhoods along protein sequences for inexact matching of k-length subsequences ("k-mers") in the data. By use of an efficient data structure, the kernels are fast to compute once the profiles have been obtained. For example, the time needed to run PSI-BLAST in order to build the pro- files is significantly longer than both the kernel computation time and the SVM training time. We present remote homology detection experiments based on the SCOP database where we show that profile-based string kernels used with SVM classifiers strongly outperform all recently presented supervised SVM methods. We also show how we can use the learned SVM classifier to extract "discriminative sequence motifs" 驴 short regions of the original profile that contribute almost all the weight of the SVM classification score 驴 and show that these discriminative motifs correspond to meaningful structural features in the protein data. The use of PSI-BLAST profiles can be seen as a semi-supervised learning technique, since PSI-BLAST leverages unlabeled data from a large sequence database to build more informative profiles. Recently presented "cluster kernels" give general semi-supervised methods for improving SVM protein classification performance. We show that our profile kernel results are comparable to cluster kernels while providing much better scalability to large datasets.