Unsupervised segmentation of Chinese text by use of branching entropy

Authors:
Zhihui Jin;Kumiko Tanaka-Ishii
Affiliations:
University of Tokyo;University of Tokyo
Venue:
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Year:
2006

Citing 5
Cited 10

Text compression

Text compression
Mostly-unsupervised statistical segmentation of Japanese: applications to kanji

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
HHMM-based Chinese lexical analyzer ICTCLAS

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Unsupervised segmentation of chinese corpus using accessor variety

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Entropy as an indicator of context boundaries: an experiment using a web search engine

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Multilingual phrase-based concordance generation in real-time

Information Retrieval
Punctuation as implicit annotations for chinese word segmentation

Computational Linguistics
An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Integrating unsupervised and supervised word segmentation: The role of goodness measures

Information Sciences: an International Journal
A new unsupervised approach to word segmentation

Computational Linguistics
From phoneme to morpheme: another verification using a corpus

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Unsupervized word segmentation: the case for Mandarin Chinese

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2
A regularized compression method to unsupervised word segmentation

SIGMORPHON '12 Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology
Unknown Chinese word extraction based on variety of overlapping strings

Information Processing and Management: an International Journal
The application of kalman filter based human-computer learning model to chinese word segmentation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% was attained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size.