Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation

  • Authors:
  • Mu Li;Jianfeng Gao;Changning Huang;Jianfeng Li

  • Affiliations:
  • Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;University of Science and Technology of China, Hefei, China

  • Venue:
  • SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes an unsupervised training approach to resolving overlapping ambiguities in Chinese word segmentation. We present an ensemble of adapted Naïve Bayesian classifiers that can be trained using an unlabelled Chinese text corpus. These classifiers differ in that they use context words within windows of different sizes as features. The performance of our approach is evaluated on a manually annotated test set. Experimental results show that the proposed approach achieves an accuracy of 94.3%, rivaling the rule-based and supervised training methods.