The sequence memoizer

  • Authors:
  • Frank Wood;Jan Gasthaus;Cédric Archambeau;Lancelot James;Yee Whye Teh

  • Affiliations:
  • Columbia University, New York;University College London, England;Xerox Research Centre Europe, Grenoble, France;Hong Kong University of Science and Technology, Kowloon, Hong Kong;University College London, England

  • Venue:
  • Communications of the ACM
  • Year:
  • 2011

Quantified Score

Hi-index 48.22

Visualization

Abstract

Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.