Efficient Computation of Gapped Substring Kernels on Large Alphabets

Authors:
Juho Rousu;John Shawe-Taylor
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2005

Citing 0
Cited 11

Sequence-similarity kernels for SVMs to detect anomalies in system calls

Neurocomputing
Kernel-Based Learning of Hierarchical Multilabel Classification Models

The Journal of Machine Learning Research
Efficient computations of gapped string kernels based on suffix kernel

Neurocomputing
Linear-Time Computation of Similarity Measures for Sequential Data

The Journal of Machine Learning Research
A dependency-based word subsequence kernel

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Graph-based learning for statistical machine translation

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Semi-supervised abstraction-augmented string kernel for multi-level bio-relation extraction

ECML PKDD'10 Proceedings of the 2010 European conference on Machine learning and knowledge discovery in databases: Part II
Conceptual modeling of online entertainment programming guide for natural language interface

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
A sum-over-paths extension of edit distances accounting for all sequence alignments

Pattern Recognition
Efficient algorithms for similarity measures over sequential data: a look beyond kernels

DAGM'06 Proceedings of the 28th conference on Pattern Recognition
A fast bit-parallel algorithm for gapped string kernels

ICONIP'06 Proceedings of the 13 international conference on Neural Information Processing - Volume Part I

Quantified Score

Hi-index	0.01

Visualization

Abstract

We present a sparse dynamic programming algorithm that, given two strings s and t , a gap penalty λ, and an integer p, computes the value of the gap-weighted length-p subsequences kernel. The algorithm works in time O(p |M| log |t|), where M = {(i,j) | si = tj} is the set of matches of characters in the two sequences. The algorithm is easily adapted to handle bounded length subsequences and different gap-penalty schemes, including penalizing by the total length of gaps and the number of gaps as well as incorporating character-specific match/gap penalties.The new algorithm is empirically evaluated against a full dynamic programming approach and a trie-based algorithm both on synthetic and newswire article data. Based on the experiments, the full dynamic programming approach is the fastest on short strings, and on long strings if the alphabet is small. On large alphabets, the new sparse dynamic programming algorithm is the most efficient. On medium-sized alphabets the trie-based approach is best if the maximum number of allowed gaps is strongly restricted.