Unsupervized word segmentation: the case for Mandarin Chinese

Authors:
Pierre Magistry;Benoît Sagot
Affiliations:
INRIA & Univ. Paris, Paris, France;INRIA & Univ. Paris, Paris, France
Venue:
ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
Year:
2012

Citing 8
Cited 0

An Unsupervised Algorithm for Segmenting Categorical Timeseries into Episodes

Proceedings of the ESF Exploratory Workshop on Pattern Detection and Discovery
Accessor variety criteria for Chinese word extraction

Computational Linguistics
Contextual dependencies in unsupervised word segmentation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Unsupervised segmentation of Chinese text by use of branching entropy

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
A fast decoder for joint word segmentation and POS-tagging using a single discriminative model

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
A new unsupervised approach to word segmentation

Computational Linguistics
Entropy as an indicator of context boundaries: an experiment using a web search engine

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present an unsupervized segmentation system tested on Mandarin Chinese. Following Harris's Hypothesis in Kempe (1999) and Tanaka-Ishii's (2005) reformulation, we base our work on the Variation of Branching Entropy. We improve on (Jin and Tanaka-Ishii, 2006) by adding normalization and viterbi-decoding. This enable us to remove most of the thresholds and parameters from their model and to reach near state-of-the-art results (Wang et al., 2011) with a simpler system. We provide evaluation on different corpora available from the Segmentation bake-off II (Emerson, 2005) and define a more precise topline for the task using cross-trained supervized system available off-the-shelf (Zhang and Clark, 2010; Zhao and Kit, 2008; Huang and Zhao, 2007)