Statistical Part-of-Speech Tagging for Classical Chinese

Authors:
Liang Huang;Yinan Peng;Huan Wang;Zhenyu Wu
Affiliations:
-;-;-;-
Venue:
TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Year:
2002

Citing 6
Cited 3

Natural language understanding (2nd ed.)

Natural language understanding (2nd ed.)
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging

Computational Linguistics
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A simple rule-based part of speech tagger

ANLC '92 Proceedings of the third conference on Applied natural language processing
Acquiring disambiguation rules from text

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Tagging text with a probabilistic model

ICASSP '91 Proceedings of the Acoustics, Speech, and Signal Processing, 1991. ICASSP-91., 1991 International Conference

PCFG parsing for restricted classical Chinese texts

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Pseudo context-sensitive models for parsing isolating languages: classical Chinese-a case study

CICLing'03 Proceedings of the 4th international conference on Computational linguistics and intelligent text processing
A classical Chinese corpus with nested part-of-speech tags

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Classical Chinese is essentially different from Modern Chinese, in both syntax and morphology. While there has recently been a number of works on part-of-speech (PoS) tagging for Modern Chinese, the PoS tagging for Classical Chinese is largely neglected. To the best of our knowledge, this is the first work in the area. Fortunately however, in terms of tagging, Classical Chinese is easier than Modern Chinese in that most Classical Chinese words are single-character-formed, thus no segmentation is needed. So in this paper, we will propose and analyze a simple statistical approach for PoS tagging of Classical Chinese. We first designed a tagset for Classical Chinese that is later shown to be accurate and efficient. Then we apply the hidden Markov model (HMM) Viterbi algorithm and made several improvements, such as sparse data problem handling and unknown word guessing, both designed particularly for Classical Chinese. As the training set grows larger, the accuracies for bigram and trigram increase to 94.9% and 97.6%, respectively. The contribution of our work also lies in proposing and solving some previously unseen problems in processing Classical Chinese.