Self-Supervised Chinese Word Segmentation

  • Authors:
  • Fuchun Peng;Dale Schuurmans

  • Affiliations:
  • -;-

  • Venue:
  • IDA '01 Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

We propose a new unsupervised training method for acquiring probability models that accurately segment Chinese character sequences into words. By constructing a core lexicon to guide unsupervised word learning, self-supervised segmentation overcomes the local maxima problems that hamper standard EM training. Our procedure uses successive EM phases to learn a good probability model over character strings, and then prunes this model with a mutual information selection criterion to obtain a more accurate word lexicon. The segmentations produced by these models are more accurate than those produced by training with EM alone.