Extending a thesaurus with words from Pan-Chinese sources

Authors:
Oi Yee Kwong;Benjamin K. Tsou
Affiliations:
City University of Hong Kong, Kowloon, Hong Kong;City University of Hong Kong, Kowloon, Hong Kong
Venue:
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Year:
2008

Citing 8
Cited 0

Automatic retrieval and clustering of similar words

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 2
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Automatic construction of a hypernym-labeled noun hierarchy from text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Semantic classification of Chinese unknown words

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 2
Boosting automatic lexical acquisition with morphological information

ULA '02 Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition - Volume 9
A bootstrapping method for learning semantic lexicons using extraction pattern contexts

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Linguistic preprocessing for distributional classification of words

ElectricDict '04 Proceedings of the Workshop on Enhancing and Using Electronic Dictionaries
Feasibility of enriching a chinese synonym dictionary with a synchronous chinese corpus

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we work on extending a Chinese thesaurus with words distinctly used in various Chinese communities. The acquisition and classification of such region-specific lexical items is an important step toward the larger goal of constructing a Pan-Chinese lexical resource. In particular, we extend a previous study in three respects: (1) to improve automatic classification by removing duplicated words from the thesaurus, (2) to experiment with classifying words at the subclass level and semantic head level, and (3) to further investigate the possible effects of data heterogeneity between the region-specific words and words in the thesaurus on classification performance. Automatic classification was based on the similarity between a target word and individual categories of words in the thesaurus, measured by the cosine function. Experiments were done on 120 target words from four regions. The automatic classification results were evaluated against a gold standard obtained from human judgements. In general accuracy reached 80% or more with the top 10 (out of 80+) and top 100 (out of 1,300+) candidates considered at the subclass level and semantic head level respectively, provided that the appropriate data sources were used.