Subword-based tagging for confidence-dependent Chinese word segmentation

  • Authors:
  • Ruiqiang Zhang;Genichiro Kikui;Eiichiro Sumita

  • Affiliations:
  • National Institute of Information and Communications Technology and ATR Spoken Language Communication Research Laboratories, Kyoto, Japan;NTT;National Institute of Information and Communications Technology and ATR Spoken Language Communication Research Laboratories, Kyoto, Japan

  • Venue:
  • COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found that the proposed subword-based tagging outperformed the character-based tagging in all comparative experiments. In addition, we proposed a confidence measure approach to combine the results of a dictionary-based and a subword-tagging-based segmentation. This approach can produce an ideal tradeoff between the in-vocaulary rate and out-of-vocabulary rate. Our techniques were evaluated using the test data from Sighan Bakeoff 2005. We achieved higher F-scores than the best results in three of the four corpora: PKU(0.951), CITYU(0.950) and MSR(0.971).