A succinct N-gram language model

  • Authors:
  • Taro Watanabe;Hajime Tsukada;Hideki Isozaki

  • Affiliations:
  • NTT Communication Science Laboratories, Soraku-gun, Kyoto, Japan;NTT Communication Science Laboratories, Soraku-gun, Kyoto, Japan;NTT Communication Science Laboratories, Soraku-gun, Kyoto, Japan

  • Venue:
  • ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient processing of tera-scale text data is an important research topic. This paper proposes lossless compression of N-gram language models based on LOUDS, a succinct data structure. LOUDS succinctly represents a trie with M nodes as a 2M + 1 bit string. We compress it further for the N-gram language model structure. We also use 'variable length coding' and 'block-wise compression' to compress values associated with nodes. Experimental results for three large-scale N-gram compression tasks achieved a significant compression rate without any loss.