An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

Authors:
Valentin Zhikov;Hiroya Takamura;Manabu Okumura
Affiliations:
Tokyo Institute of Technology;Tokyo Institute of Technology;Tokyo Institute of Technology
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 10
Cited 2

Suffix arrays: a new method for on-line string searches

SIAM Journal on Computing
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
The Penn Treebank: annotating predicate argument structure

HLT '94 Proceedings of the workshop on Human Language Technology
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Nonparametric bayesian models of lexical acquisition

Nonparametric bayesian models of lexical acquisition
Training conditional random fields using incomplete annotations

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Finding structure via compression

NeMLaP3/CoNLL '98 Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1

Fully unsupervised word segmentation with BVE and MDL

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
A regularized compression method to unsupervised word segmentation

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves an accuracy, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time.