Subword-based tagging by conditional random fields for Chinese word segmentation

  • Authors:
  • Ruiqiang Zhang;Genichiro Kikui;Eiichiro Sumita

  • Affiliations:
  • National Institute of Information and Communications Technology and ATR Spoken Language Communication Research Laboratories, Soraku-gun, Kyoto, Japan;ATR Spoken Language Communication Research Laboratories, Soraku-gun, Kyoto, Japan;National Institute of Information and Communications Technology and ATR Spoken Language Communication Research Laboratories, Soraku-gun, Kyoto, Japan

  • Venue:
  • NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We proposed two approaches to improve Chinese word segmentation: a subword-based tagging and a confidence measure approach. We found the former achieved better performance than the existing character-based tagging, and the latter improved segmentation further by combining the former with a dictionary-based segmentation. In addition, the latter can be used to balance out-of-vocabulary rates and in-vocabulary rates. By these techniques we achieved higher F-scores in CITYU, PKU and MSR corpora than the best results from Sighan Bakeoff 2005.