Integrating ngram model and case-based learning for Chinese word segmentation

Authors:
Chunyu Kit;Zhiming Xu;Jonathan J. Webster
Affiliations:
City University of Hong Kong, Kowloon, Hong Kong;City University of Hong Kong, Kowloon, Hong Kong;City University of Hong Kong, Kowloon, Hong Kong
Venue:
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Year:
2003

Citing 3
Cited 1

A corpus-based approach to language learning

A corpus-based approach to language learning
A trainable rule-based algorithm for word segmentation

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18

An integrative Chinese lexical analyzer based on maximum matching and second-maximum matching segmentation

ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents our recent work for participation in the First International Chinese Word Segmentation Bake-off (ICWSB-1). It is based on a general-purpose ngram model for word segmentation and a case-based learning approach to disambiguation. This system excels in identifying in-vocabulary (IV) words, achieving a recall of around 96-98%. Here we present our strategies for language model training and disambiguation rule learning, analyze the system's performance, and discuss areas for further improvement, e.g., out-of-vocabulary (OOV) word discovery.