Original Contribution: Stacked generalization
Neural Networks
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank
Computational Linguistics
Word lattice reranking for Chinese word segmentation and part-of-speech tagging
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Exploiting heterogeneous treebanks for parsing
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
An error-driven word-character hybrid model for joint Chinese word segmentation and POS tagging
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Practical very large scale CRFs
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
A fast decoder for joint word segmentation and POS-tagging using a single discriminative model
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Word-based and character-based word segmentation models: comparison and combination
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Enhancing Chinese word segmentation using unlabeled data
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Hi-index | 0.00 |
We address the issue of consuming heterogeneous annotation data for Chinese word segmentation and part-of-speech tagging. We empirically analyze the diversity between two representative corpora, i.e. Penn Chinese Treebank (CTB) and PKU's People's Daily (PPD), on manually mapped data, and show that their linguistic annotations are systematically different and highly compatible. The analysis is further exploited to improve processing accuracy by (1) integrating systems that are respectively trained on heterogeneous annotations to reduce the approximation error, and (2) re-training models with high quality automatically converted data to reduce the estimation error. Evaluation on the CTB and PPD data shows that our novel model achieves a relative error reduction of 11% over the best reported result in the literature.