A classical Chinese corpus with nested part-of-speech tags

Authors:
John Lee
Affiliations:
City University of Hong Kong
Venue:
LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Year:
2012

Citing 7
Cited 1

A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
Statistical Part-of-Speech Tagging for Classical Chinese

TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Multiword Expressions: A Pain in the Neck for NLP

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Natural Language Engineering
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation

A dependency treebank of classical Chinese poems

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce a corpus of classical Chinese poems that has been word segmented and tagged with parts-of-speech (POS). Due to the ill-defined concept of a 'word' in Chinese, previous Chinese corpora suffer from a lack of standardization in word segmentation, resulting in inconsistencies in POS tags, therefore hindering interoperability among corpora. We address this problem with nested POS tags, which accommodates different theories of wordhood and facilitates research objectives requiring annotations of the 'word' at different levels of granularity.