Modeling protein families using probabilistic suffix trees
RECOMB '99 Proceedings of the third annual international conference on Computational molecular biology
Proceedings of the sixth annual international conference on Computational biology
A Kernel Approach for Learning from almost Orthogonal Patterns
ECML '02 Proceedings of the 13th European Conference on Machine Learning
IEEE Transactions on Information Theory
The context-tree weighting method: basic properties
IEEE Transactions on Information Theory
Information distance from a question to an answer
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
New information distance measure and its application in question answering system
Journal of Computer Science and Technology
Information distance and its extensions
DS'11 Proceedings of the 14th international conference on Discovery science
Information distance and its applications
CIAA'06 Proceedings of the 11th international conference on Implementation and Application of Automata
Classification of biological sequences with kernel methods
ICGI'06 Proceedings of the 8th international conference on Grammatical Inference: algorithms and applications
Classifying stem cell differentiation images by information distance
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Information distance between what I said and what it heard
Communications of the ACM
Hi-index | 0.02 |
We propose a new kernel for strings which borrows ideas and techniques from information theory and data compression. This kernel can be used in combination with any kernel method, in particular Support Vector Machines for string classification, with notable applications in proteomics. By using a Bayesian averaging framework with conjugate priors on a class of Markovian models known as probabilistic suffix trees or context-trees, we compute the value of this kernel in linear time and space while only using the information contained in the spectrum of the considered strings. This is ensured through an adaptation of a compression method known as the context-tree weighting algorithm. Encouraging classification results are reported on a standard protein homology detection experiment, showing that the context-tree kernel performs well with respect to other state-of-the-art methods while using no biological prior knowledge.