New indices for text: PAT Trees and PAT arrays
Information retrieval
Suffix arrays: a new method for on-line string searches
SIAM Journal on Computing
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
AI Game Programming Wisdom
Text classification using string kernels
The Journal of Machine Learning Research
Replacing suffix trees with enhanced suffix arrays
Journal of Discrete Algorithms - SPIRE 2002
An efficient, versatile approach to suffix sorting
Journal of Experimental Algorithmics (JEA)
Linear-Time Computation of Similarity Measures for Sequential Data
The Journal of Machine Learning Research
Kernel-based machine learning for fast text mining in R
Computational Statistics & Data Analysis
Hash Kernels for Structured Data
The Journal of Machine Learning Research
A composite kernel for named entity recognition
Pattern Recognition Letters
A novel composite kernel for finding similar questions in CQA services
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Sparse substring pattern set discovery using linear programming boosting
DS'10 Proceedings of the 13th international conference on Discovery science
Computing matching statistics and maximal exact matches on compressed full-text indexes
SPIRE'10 Proceedings of the 17th international conference on String processing and information retrieval
A subpath kernel for rooted unordered trees
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Bidirectional search in a string with wavelet trees and bidirectional matching statistics
Information and Computation
Algorithms and Applications
Improving tweet stream classification by detecting changes in word probability
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Fast q-gram mining on SLP compressed strings
Journal of Discrete Algorithms
Hi-index | 0.00 |
String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required -- 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims.