A discriminative latent variable chinese segmenter with hybrid word/character information

  • Authors:
  • Xu Sun;Yaozhong Zhang;Takuya Matsuzaki;Yoshimasa Tsuruoka;Jun'ichi Tsujii

  • Affiliations:
  • University of Tokyo;University of Tokyo;University of Tokyo;University of Manchester;University of Tokyo, Japan and University of Manchester, UK

  • Venue:
  • NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Conventional approaches to Chinese word segmentation treat the problem as a character-based tagging task. Recently, semi-Markov models have been applied to the problem, incorporating features based on complete words. In this paper, we propose an alternative, a latent variable model, which uses hybrid information based on both word sequences and character sequences. We argue that the use of latent variables can help capture long range dependencies and improve the recall on segmenting long words, e.g., named-entities. Experimental results show that this is indeed the case. With this improvement, evaluations on the data of the second SIGHAN CWS bakeoff show that our system is competitive with the best ones in the literature.