Integrating ngram model and case-based learning for Chinese word segmentation

  • Authors:
  • Chunyu Kit;Zhiming Xu;Jonathan J. Webster

  • Affiliations:
  • City University of Hong Kong, Kowloon, Hong Kong;City University of Hong Kong, Kowloon, Hong Kong;City University of Hong Kong, Kowloon, Hong Kong

  • Venue:
  • SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents our recent work for participation in the First International Chinese Word Segmentation Bake-off (ICWSB-1). It is based on a general-purpose ngram model for word segmentation and a case-based learning approach to disambiguation. This system excels in identifying in-vocabulary (IV) words, achieving a recall of around 96-98%. Here we present our strategies for language model training and disambiguation rule learning, analyze the system's performance, and discuss areas for further improvement, e.g., out-of-vocabulary (OOV) word discovery.