Enhanced suffix arrays as language models: virtual k-testable languages

  • Authors:
  • Herman Stehouwer;Menno van Zaanen

  • Affiliations:
  • TiCC, Tilburg University, Tilburg, The Netherlands;TiCC, Tilburg University, Tilburg, The Netherlands

  • Venue:
  • ICGI'10 Proceedings of the 10th international colloquium conference on Grammatical inference: theoretical results and applications
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this article, we propose the use of suffix arrays to efficiently implement n-gram language models with practically unlimited size n. This approach, which is used with synchronous back-off, allows us to distinguish between alternative sequences using large contexts. We also show that we can build this kind of models with additional information for each symbol, such as part-of-speech tags and dependency information. The approach can also be viewed as a collection of virtual k-testable automata. Once built, we can directly access the results of any k-testable automaton generated from the input training data. Synchronous backoff automatically identifies the k-testable automaton with the largest feasible k. We have used this approach in several classification tasks.