Automatic Treebank Conversion via Informed Decoding - A Case Study on Chinese Treebanks

Authors:
Muhua Zhu;Jingbo Zhu;Tong Xiao
Affiliations:
Northeastern University, China;Northeastern University, China;Northeastern University, China
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2011

Citing 17
Cited 1

Head-driven statistical models for natural language parsing

Head-driven statistical models for natural language parsing
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
An automatic treebank conversion algorithm for corpus sharing

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
A second-order Hidden Markov Model for part-of-speech tagging

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
A statistical parser for Czech

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Converting dependency structures to phrase structures

HLT '01 Proceedings of the first international conference on Human language technology research
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Scaling to very very large corpora for natural language disambiguation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach

Computational Linguistics
Learning accurate, compact, and interpretable tree annotation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Part-of-speech tagging of modern hebrew text

Natural Language Engineering
Label correspondence learning for part-of-speech annotation transformation

Proceedings of the 18th ACM conference on Information and knowledge management
Exploiting heterogeneous treebanks for parsing

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Automatic adaptation of annotation standards for dependency parsing: using projected treebank as source corpus

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
K-best combination of syntactic parsers

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Automatic treebank conversion via informed decoding

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

A feature-based approach to better automatic treebank conversion

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Treebanks are valuable resources for syntactic parsing. For some languages such as Chinese, we can obtain multiple constituency treebanks which are developed by different organizations. However, due to discrepancies of underlying annotation standards, such treebanks in general cannot be used together through direct data combination. To enlarge training data for syntactic parsing, we focus in this article on the challenge of unifying standards of disparate treebanks by automatically converting one treebank (source treebank) to fit a different standard which is exhibited by another treebank (target treebank). We propose to convert a treebank in two sequential steps which correspond to the part-of-speech level and syntactic structure level (including tree structures and grammar labels), respectively. Approaches used in both levels can be unified as an informed decoding procedure, where information derived from original annotation in a source treebank is used to guide the conversion conducted by a POS tagger (or a parser in the syntactic structure level) trained on a target treebank. We take two Chinese treebanks as a case study, and experiments on these two treebanks show significant improvements in conversion accuracy over baseline systems, especially in situations where a target treebank is small in size.