Efficient evaluation of large sequence kernels

Authors:
Pavel P. Kuksa;Vladimir Pavlovic
Affiliations:
NEC Laboratories America, Princeton, NJ, USA;Rutgers University, New Brunswick, NJ, USA
Venue:
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2012

Citing 14
Cited 0

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Proceedings of the sixth annual international conference on Computational biology
A comparative study on content-based music genre classification

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Text classification using string kernels

The Journal of Machine Learning Research
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
Profile-Based String Kernels for Remote Homology Detection and Motif Extraction

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
Mismatch string kernels for discriminative protein classification

Bioinformatics
Semi-supervised protein classification using cluster kernels

Bioinformatics
Multi-class protein fold recognition using adaptive codes

ICML '05 Proceedings of the 22nd international conference on Machine learning
Large scale genomic sequence SVM classifiers

ICML '05 Proceedings of the 22nd international conference on Machine learning
A machine learning information retrieval approach to protein fold recognition

Bioinformatics
Using string kernels to identify famous performers from their playing style

Intelligent Data Analysis
Spatial Representation for Efficient Sequence Classification

ICPR '10 Proceedings of the 2010 20th International Conference on Pattern Recognition
Bounded coordinate-descent for biological sequence classification in high dimensional predictor space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
An overview of statistical learning theory

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classification of sequences drawn from a finite alphabet using a family of string kernels with inexact matching (e.g., spectrum or mismatch) has shown great success in machine learning. However, selection of optimal mismatch kernels for a particular task is severely limited by inability to compute such kernels for long substrings (k-mers) with potentially many mismatches (m). In this work we introduce a new method that allows us to exactly evaluate kernels for large k, m and arbitrary alphabet size. The task can be accomplished by first solving the more tractable problem for small alphabets, and then trivially generalizing to any alphabet using a small linear system of equations. This makes it possible to explore a larger set of kernels with a wide range of kernel parameters, opening a possibility to better model selection and improved performance of the string kernels. To investigate the utility of large (k,m) string kernels, we consider several sequence classification problems, including protein remote homology detection, fold prediction, and music classification. Our results show that increased k-mer lengths with larger substitutions can improve classification performance.