Accurate methods for the statistics of surprise and coincidence
Computational Linguistics - Special issue on using large corpora: I
CSeg& Tag1.0: a practical word segmenter and POS tagger for Chinese texts
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Acquisition of lexical information: from a large textual Italian corpus
COLING '90 Proceedings of the 13th conference on Computational linguistics - Volume 3
Automatic corpus-based Thai word extraction with the c4.5 learning algorithm
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
Hi-index | 0.00 |
This paper presents a novel approach to Chinese word extraction based on semantic information of characters. A thesaurus of Chinese characters is conducted. A Chinese lexicon with 63,738 two-character words, together with the thesaurus of characters, are explored to learn semantic constraints between characters in Chinese word-formation, forming a semantic-tag-based HMM. The Baum-Welch re-estimation scheme is then chosen to train parameters of the HMM in the way of unsupervised learning. Various statistical measures for estimating the likelihood of a character string being a word are further tested. Large-scale experiments show that the results are promising: the F-score of this word extraction method can reach 68.5% whereas its counterpart, the character-based mutual information method, can only reach 47.5%.