Chinese new word finding using character-based parsing model

Authors:
Yao Meng;Hao Yu;Fumihito Nishino
Affiliations:
FUJITSU R&D Center Co., Ltd, District Beijing, P.R.China;FUJITSU R&D Center Co., Ltd, District Beijing, P.R.China;FUJITSU R&D Center Co., Ltd, District Beijing, P.R.China
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 1
Cited 0

Chinese word segmentation without using lexicon and hand-crafted training data

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The new word finding is a difficult and indispensable task in Chinese segmentation. The traditional methods used the string statistical information to identify the new words in the large-scale corpus. But it is neither convenient nor powerful enough to describe the words’ internal and external structure laws. And it is even the less effective when the occurrence frequency of the new words is very low in the corpus. In this paper, we present a novel method of using parsing information to find the new words. A character level PCFG model is trained by People Daily corpus and Penn Chinese Treebank. The characters are inputted into the character parsing system, and the words are determined by the parsing tree automatically. Our method describes the word-building rules in the full sentences, and takes advantage of rich context to find the new words. This is especially effective in identifying the occasional words or rarely used words, which are usually in low frequency. The preliminary experiments indicate that our method can substantially improve the precision and recall of the new word finding process.