Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

Authors:
Mu Li;Jianfeng Gao;Changning Huang;Jianfeng Li
Affiliations:
Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;University of Science and Technology of China, Hefei, China
Venue:
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Year:
2003

Citing 3
Cited 8

Toward a unified approach to statistical language modeling for Chinese

ACM Transactions on Asian Language Information Processing (TALIP)
A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Classifier combination for improved lexical disambiguation

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
A Unified Character-Based Tagging Framework for Chinese Word Segmentation

ACM Transactions on Asian Language Information Processing (TALIP)
An example-based study on chinese word segmentation using critical fragments

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation

ROCLING '11 Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing
The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context

ACM Transactions on Asian Language Information Processing (TALIP)
Revising word lattice using support vector machine for Chinese word segmentation

Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services
The application of kalman filter based human-computer learning model to chinese word segmentation

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an unsupervised training approach to resolving overlapping ambiguities in Chinese word segmentation. We present an ensemble of adapted Naïve Bayesian classifiers that can be trained using an unlabelled Chinese text corpus. These classifiers differ in that they use context words within windows of different sizes as features. The performance of our approach is evaluated on a manually annotated test set. Experimental results show that the proposed approach achieves an accuracy of 94.3%, rivaling the rule-based and supervised training methods.