Fast and space efficient string kernels using suffix arrays

Authors:
Choon Hui Teo;S. V. N. Vishwanathan
Affiliations:
National ICT Australia, Canberra ACT, Australia and Australian National University, Canberra ACT, Australia;National ICT Australia, Canberra ACT, Australia and Australian National University, Canberra ACT, Australia
Venue:
ICML '06 Proceedings of the 23rd international conference on Machine learning
Year:
2006

Citing 7
Cited 12

New indices for text: PAT Trees and PAT arrays

Information retrieval
Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
AI Game Programming Wisdom

AI Game Programming Wisdom
Text classification using string kernels

The Journal of Machine Learning Research
Replacing suffix trees with enhanced suffix arrays

Journal of Discrete Algorithms - SPIRE 2002
An efficient, versatile approach to suffix sorting

Journal of Experimental Algorithmics (JEA)

Linear-Time Computation of Similarity Measures for Sequential Data

The Journal of Machine Learning Research
Kernel-based machine learning for fast text mining in R

Computational Statistics & Data Analysis
Hash Kernels for Structured Data

The Journal of Machine Learning Research
A composite kernel for named entity recognition

Pattern Recognition Letters
A novel composite kernel for finding similar questions in CQA services

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Sparse substring pattern set discovery using linear programming boosting

DS'10 Proceedings of the 13th international conference on Discovery science
Computing matching statistics and maximal exact matches on compressed full-text indexes

SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
A subpath kernel for rooted unordered trees

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Bidirectional search in a string with wavelet trees and bidirectional matching statistics

Information and Computation
The support vector tree

Algorithms and Applications
Improving tweet stream classification by detecting changes in word probability

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Fast q-gram mining on SLP compressed strings

Journal of Discrete Algorithms

Quantified Score

Hi-index	0.00

Visualization

Abstract

String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required -- 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims.