Faster and smaller N-gram language models

Authors:
Adam Pauls;Dan Klein
Affiliations:
University of California, Berkeley;University of California, Berkeley
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Year:
2011

Citing 10
Cited 12

Trie memory

Communications of the ACM
The Bloomier filter: an efficient data structure for static support lookup tables

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
The Alignment Template Approach to Statistical Machine Translation

Computational Linguistics
A hierarchical phrase-based model for statistical machine translation

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Tightly packed tries: how to fit large models into memory, and make them load fast, too

SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
A scalable decoder for parsing-based machine translation with equivalent language model state maintenance

SSST '08 Proceedings of the Second Workshop on Syntax and Structure in Statistical Translation
Efficient handling of N-gram language models for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Joshua: an open source toolkit for parsing-based machine translation

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Stream-based randomised language models for SMT

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Storing the web in memory: space efficient language models with constant time retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

KenLM: faster and smaller language model queries

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
NADA: a robust system for non-referential pronoun detection

DAARC'11 Proceedings of the 8th international conference on Anaphora Processing and Applications
NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Akamon: an open source toolkit for tree/forest-based statistical machine translation

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Large-scale syntactic language modeling with treelets

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
A comparative study of target dependency structures for statistical machine translation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A beam-search decoder for grammatical error correction

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
A systematic comparison of phrase table pruning techniques

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Joshua 4.0: packing, PRO, and paraphrases

WMT '12 Proceedings of the Seventh Workshop on Statistical Machine Translation
Bagging and Boosting statistical machine translation systems

Artificial Intelligence
From query to question in one click: suggesting synthetic questions to searchers

Proceedings of the 22nd international conference on World Wide Web
Unsupervised language model adaptation for handwritten Chinese text recognition

Pattern Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

N-gram language models are a major resource bottleneck in machine translation. In this paper, we present several language model implementations that are both highly compact and fast to query. Our fastest implementation is as fast as the widely used SRILM while requiring only 25% of the storage. Our most compact representation can store all 4 billion n-grams and associated counts for the Google n-gram corpus in 23 bits per n-gram, the most compact lossless representation to date, and even more compact than recent lossy compression techniques. We also discuss techniques for improving query speed during decoding, including a simple but novel language model caching technique that improves the query speed of our language models (and SRILM) by up to 300%.