The sequence memoizer

Authors:
Frank Wood;Jan Gasthaus;Cédric Archambeau;Lancelot James;Yee Whye Teh
Affiliations:
Columbia University, New York;University College London, England;Xerox Research Centre Europe, Grenoble, France;Hong Kong University of Science and Technology, Kowloon, Hong Kong;University College London, England
Venue:
Communications of the ACM
Year:
2011

Citing 9
Cited 2

A mathematical theory of communication

ACM SIGMOBILE Mobile Computing and Communications Review
A neural probabilistic language model

The Journal of Machine Learning Research
Monte Carlo Statistical Methods (Springer Texts in Statistics)

Monte Carlo Statistical Methods (Springer Texts in Statistics)
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A new ppm variant for chinese text compression

Natural Language Engineering
Improving a statistical language model through non-linear prediction

Neurocomputing
A stochastic memoizer for sequence data

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Lossless Compression Based on the Sequence Memoizer

DCC '10 Proceedings of the 2010 Data Compression Conference
The context-tree weighting method: extensions

IEEE Transactions on Information Theory

A bayesian model for learning SCFGs with discontiguous rules

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Margin-maximizing classification of sequential data with infinitely-long temporal dependencies

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	48.22

Visualization

Abstract

Probabilistic models of sequences play a central role in most machine translation, automated speech recognition, lossless compression, spell-checking, and gene identification applications to name but a few. Unfortunately, real-world sequence data often exhibit long range dependencies which can only be captured by computationally challenging, complex models. Sequence data arising from natural processes also often exhibits power-law properties, yet common sequence models do not capture such properties. The sequence memoizer is a new hierarchical Bayesian model for discrete sequence data that captures long range dependencies and power-law characteristics, while remaining computationally attractive. Its utility as a language model and general purpose lossless compressor is demonstrated.