2005 Special Issue: The context-tree kernel for strings

Authors:
Marco Cuturi;Jean-Philippe Vert
Affiliations:
Computational Biology Group, Ecole des Mines de Paris, 35 rue Saint Honoré, 77300 Fontainebleau, France and The Institute of Statistical Mathematics, 4-6-7 Minami-azabu, Minato-ku, Tokyo 106- ...;Computational Biology Group, Ecole des Mines de Paris, 35 rue Saint Honoré, 77300 Fontainebleau, France
Venue:
Neural Networks - Special issue on neural networks and kernel methods for structured domains
Year:
2005

Citing 5
Cited 7

Modeling protein families using probabilistic suffix trees

RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Proceedings of the sixth annual international conference on Computational biology
A Kernel Approach for Learning from almost Orthogonal Patterns

ECML '02 Proceedings of the 13th European Conference on Machine Learning
The similarity metric

IEEE Transactions on Information Theory
The context-tree weighting method: basic properties

IEEE Transactions on Information Theory

Information distance from a question to an answer

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
New information distance measure and its application in question answering system

Journal of Computer Science and Technology
Information distance and its extensions

DS'11 Proceedings of the 14th international conference on Discovery science
Information distance and its applications

CIAA'06 Proceedings of the 11th international conference on Implementation and Application of Automata
Classification of biological sequences with kernel methods

ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Classifying stem cell differentiation images by information distance

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Information distance between what I said and what it heard

Communications of the ACM

Quantified Score

Hi-index	0.02

Visualization

Abstract

We propose a new kernel for strings which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for string classification, with notable applications in proteomics. By using a Bayesian averaging framework with conjugate priors on a class of Markovian models known as probabilistic suffix trees or context-trees, we compute the value of this kernel in linear time and space while only using the information contained in the spectrum of the considered strings. This is ensured through an adaptation of a compression method known as the context-tree weighting algorithm. Encouraging classification results are reported on a standard protein homology detection experiment, showing that the context-tree kernel performs well with respect to other state-of-the-art methods while using no biological prior knowledge.