Tightly packed tries: how to fit large models into memory, and make them load fast, too

Authors:
Ulrich Germann;Eric Joanis;Samuel Larkin
Affiliations:
University of Toronto and National Research Council Canada;National Research Council Canada;National Research Council Canada
Venue:
SETQA-NLP '09 Proceedings of the Workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing
Year:
2009

Citing 7
Cited 4

Text compression

Text compression
Trie memory

Communications of the ACM
The Bloomier filter: an efficient data structure for static support lookup tables

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Moses: open source toolkit for statistical machine translation

ACL '07 Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions
Efficient handling of N-gram language models for statistical machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
PORTAGE: a phrase-based machine translation system

ParaText '05 Proceedings of the ACL Workshop on Building and Using Parallel Texts
How many bits are needed to store probabilities for phrase-based translation?

StatMT '06 Proceedings of the Workshop on Statistical Machine Translation

Storing the web in memory: space efficient language models with constant time retrieval

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Faster and smaller N-gram language models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
KenLM: faster and smaller language model queries

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Prius: generic hybrid trace compression for wireless sensor networks

Proceedings of the 10th ACM Conference on Embedded Network Sensor Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present Tightly Packed Tries (TPTs), a compact implementation of read-only, compressed trie structures with fast on-demand paging and short load times. We demonstrate the benefits of TPTs for storing n-gram back-off language models and phrase tables for statistical machine translation. Encoded as TPTs, these databases require less space than flat text file representations of the same data compressed with the gzip utility. At the same time, they can be mapped into memory quickly and be searched directly in time linear in the length of the key, without the need to decompress the entire file. The overhead for local decompression during search is marginal.