Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

  • Authors:
  • Daichi Mochihashi;Takeshi Yamada;Naonori Ueda

  • Affiliations:
  • NTT Communication Science Laboratories, Keihanna Science City, Kyoto, Japan;NTT Communication Science Laboratories, Keihanna Science City, Kyoto, Japan;NTT Communication Science Laboratories, Keihanna Science City, Kyoto, Japan

  • Venue:
  • ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any "word" indications.