Storing the web in memory: space efficient language models with constant time retrieval

Authors:
David Guthrie;Mark Hepple
Affiliations:
University of Sheffield;University of Sheffield
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 11
Cited 4

Self-organized language modeling for speech recognition

Readings in speech recognition
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Trie memory

Communications of the ACM
The Bloomier filter: an efficient data structure for static support lookup tables

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Tightly packed tries: how to fit large models into memory, and make them load fast, too

SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
Efficient handling of N-gram language models for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
How many bits are needed to store probabilities for phrase-based translation?

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation
Succinct approximate counting of skewed data

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Simple compression code supporting random access and fast string matching

WEA'07 Proceedings of the 6th international conference on Experimental algorithms
Broadword implementation of rank/select queries

WEA'08 Proceedings of the 7th international conference on Experimental algorithms

Faster and smaller N-gram language models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
KenLM: faster and smaller language model queries

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
NADA: a robust system for non-referential pronoun detection

DAARC'11 Proceedings of the 8th international conference on Anaphora Processing and Applications
Language model rest costs and space-efficient storage

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present three novel methods of compactly storing very large n-gram language models. These methods use substantially less space than all known approaches and allow n-gram probabilities or counts to be retrieved in constant time, at speeds comparable to modern language modeling toolkits. Our basic approach generates an explicit minimal perfect hash function, that maps all n-grams in a model to distinct integers to enable storage of associated values. Extensions of this approach exploit distributional characteristics of n-gram data to reduce storage costs, including variable length coding of values and the use of tiered structures that partition the data for more efficient storage. We apply our approach to storing the full Google Web1T n-gram set and all 1-to-5 grams of the Gigaword newswire corpus. For the 1.5 billion n-grams of Gigaword, for example, we can store full count information at a cost of 1.66 bytes per n-gram (around 30% of the cost when using the current state-of-the-art approach), or quantized counts for 1.41 bytes per n-gram. For applications that are tolerant of a certain class of relatively innocuous errors (where unseen n-grams may be accepted as rare n-grams), we can reduce the latter cost to below 1 byte per n-gram.