An LZ78 based string kernel

  • Authors:
  • Ming Li;Ronan Sleep

  • Affiliations:
  • University of East Anglia, Norwich, UK;University of East Anglia, Norwich, UK

  • Venue:
  • ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We have shown [8] that LZ78 parse length can be used effectively for a music classification task. The parse length is used to compute a normalized information distance [6,7] which is then used to drive a simple classifier. In this paper we explore a more subtle use of the LZ78 parsing algorithm. Instead of simply counting the parse length of a string, we use the coding dictionary constructed by LZ78 to derive a valid string kernel for a Support Vector Machine (SVM). The kernel is defined over a feature space indexed by all the phrases identified by our (modified) LZ78 compression algorithm. We report experiments with our kernel approach on two datasets: (i) a collection of MIDI files and (ii) Reuters-21578. We compare our technique with an n-gram based kernel. Our results indicate that the LZ78 kernel technique has a performance similar to that obtained with the best n-gram performance but with significantly lower computational overhead, and without requiring a search for the optimal value of n.