Categorical ambiguity and information content: a corpus-based study of Chinese

Authors:
Chu-Ren Huang;Ru-Yng Chang
Affiliations:
Institute of Linguistics, Nangkang, Taipei, Taiwan, R.O.C.;Institute of Linguistics, Nangkang, Taipei, Taiwan, R.O.C.
Venue:
SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Year:
2002

Citing 1
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assignment of grammatical categories is the fundamental step in natural language processing. And ambiguity resolution is one of the most challenging NLP tasks that is currently still beyond the power of machines. When two questions are combined together, the problem of resolution of categorical ambiguity is what a computational linguistic system can do reasonably good, but yet still unable to mimic the excellence of human beings. This task is even more challenging in Chinese language processing because of the poverty of morphological information to mark categories and the lack of convention to mark word boundaries. In this paper, we try to investigate the nature of categorical ambiguity in Chinese based on Sinica Corpus. The study differs crucially from previous studies in that it directly measure information content as the degree of ambiguity. This method not only offers an alternative interpretation of ambiguity, it also allows a different measure of success of categorical disambiguation. Instead of precision or recall, we can also measure by how much the information load has been reduced. This approach also allows us to identify which are the most ambiguous words in terms of information content. The somewhat surprising result actually reinforces the Saussurian view that underlying the systemic linguistic structure, assignment of linguistic content for each linguistic symbol is arbitrary.