A stochastic finite-state word-segmentation algorithm for Chinese
Computational Linguistics
Statistical Part-of-Speech Tagging for Classical Chinese
TSD '02 Proceedings of the 5th International Conference on Text, Speech and Dialogue
Multiword Expressions: A Pain in the Neck for NLP
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
The Penn Chinese TreeBank: Phrase structure annotation of a large corpus
Natural Language Engineering
The first international Chinese word segmentation Bakeoff
SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Optimizing Chinese word segmentation for machine translation performance
StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
A dependency treebank of classical Chinese poems
NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Hi-index | 0.00 |
We introduce a corpus of classical Chinese poems that has been word segmented and tagged with parts-of-speech (POS). Due to the ill-defined concept of a 'word' in Chinese, previous Chinese corpora suffer from a lack of standardization in word segmentation, resulting in inconsistencies in POS tags, therefore hindering interoperability among corpora. We address this problem with nested POS tags, which accommodates different theories of wordhood and facilitates research objectives requiring annotations of the 'word' at different levels of granularity.