Scaling high-order character language models to gigabytes

  • Authors:
  • Bob Carpenter

  • Affiliations:
  • Alias-i, Inc., Brooklyn, NY

  • Venue:
  • Software '05 Proceedings of the Workshop on Software
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We describe the implementation steps required to scale high-order character language models to gigabytes of training data without pruning. Our online models build character-level PAT trie structures on the fly using heavily data-unfolded implementations of an mutable daughter maps with a long integer count interface. Terminal nodes are shared. Character 8-gram training runs at 200,000 characters per second and allows online tuning of hyperparameters. Our compiled models precompute all probability estimates for observed n-grams and all interpolation parameters, along with suffix pointers to speedup context computations from proportional to n-gram length to a constant. The result is compiled models that are larger than the training models, but execute at 2 million characters per second on a desktop PC. Cross-entropy on held-out data shows these models to be state of the art in terms of performance.