Word sequence kernels

  • Authors:
  • Nicola Cancedda;Eric Gaussier;Cyril Goutte;Jean Michel Renders

  • Affiliations:
  • Xerox Research Centre Europe, 6, chemin de Maupertuis, 38240 Meylan, France;Xerox Research Centre Europe, 6, chemin de Maupertuis, 38240 Meylan, France;Xerox Research Centre Europe, 6, chemin de Maupertuis, 38240 Meylan, France;Xerox Research Centre Europe, 6, chemin de Maupertuis, 38240 Meylan, France

  • Venue:
  • The Journal of Machine Learning Research
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We address the problem of categorising documents using kernel-based methods such as Support Vector Machines. Since the work of Joachims (1998), there is ample experimental evidence that SVM using the standard word frequencies as features yield state-of-the-art performance on a number of benchmark problems. Recently, Lodhi et al. (2002) proposed the use of string kernels, a novel way of computing document similarity based of matching non-consecutive subsequences of characters. In this article, we propose the use of this technique with sequences of words rather than characters. This approach has several advantages, in particular it is more efficient computationally and it ties in closely with standard linguistic pre-processing techniques. We present some extensions to sequence kernels dealing with symbol-dependent and match-dependent decay factors, and present empirical evaluations of these extensions on the Reuters-21578 datasets.