Statistical Part-of-Speech Tagging for Classical Chinese

  • Authors:
  • Liang Huang;Yinan Peng;Huan Wang;Zhenyu Wu

  • Affiliations:
  • -;-;-;-

  • Venue:
  • TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Classical Chinese is essentially different from Modern Chinese, in both syntax and morphology. While there has recently been a number of works on part-of-speech (PoS) tagging for Modern Chinese, the PoS tagging for Classical Chinese is largely neglected. To the best of our knowledge, this is the first work in the area. Fortunately however, in terms of tagging, Classical Chinese is easier than Modern Chinese in that most Classical Chinese words are single-character-formed, thus no segmentation is needed. So in this paper, we will propose and analyze a simple statistical approach for PoS tagging of Classical Chinese. We first designed a tagset for Classical Chinese that is later shown to be accurate and efficient. Then we apply the hidden Markov model (HMM) Viterbi algorithm and made several improvements, such as sparse data problem handling and unknown word guessing, both designed particularly for Classical Chinese. As the training set grows larger, the accuracies for bigram and trigram increase to 94.9% and 97.6%, respectively. The contribution of our work also lies in proposing and solving some previously unseen problems in processing Classical Chinese.