A regularized compression method to unsupervised word segmentation

Authors:
Ruey-Cheng Chen;Chiung-Min Tsai;Jieh Hsiang
Affiliations:
National Taiwan University, Taipei, Taiwan;National Taiwan University, Taipei, Taiwan;National Taiwan University, Taipei, Taiwan
Venue:
SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Year:
2012

Citing 10
Cited 0

Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Efficient unsupervised recursive word segmentation using minimum description length

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Improving nonparameteric Bayesian inference: experiments on unsupervised word segmentation with adaptor grammars

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Bootstrap voting experts

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Fully unsupervised word segmentation with BVE and MDL

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Entropy as an indicator of context boundaries: an experiment using a web search engine

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Paper: Modeling by shortest data description

Automatica (Journal of IFAC)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Languages are constantly evolving through their users due to the need to communicate more efficiently. Under this hypothesis, we formulate unsupervised word segmentation as a regularized compression process. We reduce this process to an optimization problem, and propose a greedy inclusion solution. Preliminary test results on the Bernstein-Ratner corpus and Bakeoff-2005 show that the our method is comparable to the state-of-the-art in terms of effectiveness and efficiency.