Word extraction based on semantic constraints in chinese word-formation

Authors:
Maosong Sun;Shengfen Luo;Benjamin K T'sou
Affiliations:
National Lab. of Intelligent Tech. & Systems, Tsinghua University, Beijing, China;National Lab. of Intelligent Tech. & Systems, Tsinghua University, Beijing, China;Language Information Sciences Research Centre, City University of Hong Kong
Venue:
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2005

Citing 6
Cited 1

Accurate methods for the statistics of surprise and coincidence

Computational Linguistics - Special issue on using large corpora: I
CSeg& Tag1.0: a practical word segmenter and POS tagger for Chinese texts

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Acquisition of lexical information: from a large textual Italian corpus

COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Good bigrams

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Approach to construction of automatic morphological analysis systems for inflective languages with little effort

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing

Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a novel approach to Chinese word extraction based on semantic information of characters. A thesaurus of Chinese characters is conducted. A Chinese lexicon with 63,738 two-character words, together with the thesaurus of characters, are explored to learn semantic constraints between characters in Chinese word-formation, forming a semantic-tag-based HMM. The Baum-Welch re-estimation scheme is then chosen to train parameters of the HMM in the way of unsupervised learning. Various statistical measures for estimating the likelihood of a character string being a word are further tested. Large-scale experiments show that the results are promising: the F-score of this word extraction method can reach 68.5% whereas its counterpart, the character-based mutual information method, can only reach 47.5%.