Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

Authors:
Daichi Mochihashi;Takeshi Yamada;Naonori Ueda
Affiliations:
NTT Communication Science Laboratories, Keihanna Science City, Kyoto, Japan;NTT Communication Science Laboratories, Keihanna Science City, Kyoto, Japan;NTT Communication Science Laboratories, Keihanna Science City, Kyoto, Japan
Venue:
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Year:
2009

Citing 6
Cited 21

An estimate of an upper bound for the entropy of English

Computational Linguistics
Mostly-unsupervised statistical segmentation of Japanese Kanji sequences

Natural Language Engineering
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
A hierarchical Bayesian language model based on Pitman-Yor processes

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Bayesian semi-supervised Chinese word segmentation for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Topic models with power-law using Pitman-Yor process

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Subword variation in text message classification

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Variational inference for adaptor grammars

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
A Bayesian method for robust estimation of distributional similarities

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Unsupervised phonemic Chinese word segmentation using adaptor grammars

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Nonparametric word segmentation for machine translation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A method to measure the reading difficulty of Japanese words

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
A new unsupervised approach to word segmentation

Computational Linguistics
Dialect translation: integrating Bayesian co-segmentation models with pivot-based SMT

DIALECTS '11 Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
Non-parametric bayesian segmentation of Japanese noun phrases

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Method to build a bilingual lexicon for speech-to-speech translation systems

CICLing'12 Proceedings of the 13th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II
Hierarchical Bayesian language modelling for the linguistically informed

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
A hierarchical dirichlet process model for joint part-of-speech and morphology induction

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
A probabilistic model for canonicalizing named entity mentions

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Using rejuvenation to improve particle filtering for Bayesian word segmentation

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Unsupervized word segmentation: the case for Mandarin Chinese

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A regularized compression method to unsupervised word segmentation

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Iterative annotation transformation with predict-self reestimation for Chinese word segmentation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Drive video summarization based on double articulation structure of driving behavior

Proceedings of the 20th ACM international conference on Multimedia
A Bayesian Alignment Approach to Transliteration Mining

ACM Transactions on Asian Language Information Processing (TALIP)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any "word" indications.