Enhancing Chinese word segmentation using unlabeled data

  • Authors:
  • Weiwei Sun;Jia Xu

  • Affiliations:
  • Saarland University, and German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany;German Research Center for Artificial Intelligence (DFKI), Saarbrücken, Germany

  • Venue:
  • EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.