The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Authors:
Naiwen Xue;Fei Xia;Fu-dong Chiou;Marta Palmer
Affiliations:
University of Pennsylvania, Philadelphia, PA 19104, USA e-mail: xueniwen@linc.cis.upenn.edu,fxia@linc.cis.upenn.edu,chioufd@linc.cis.upenn.edu,mpalmer@linc.cis.upenn.edu;University of Pennsylvania, Philadelphia, PA 19104, USA e-mail: xueniwen@linc.cis.upenn.edu,fxia@linc.cis.upenn.edu,chioufd@linc.cis.upenn.edu,mpalmer@linc.cis.upenn.edu;University of Pennsylvania, Philadelphia, PA 19104, USA e-mail: xueniwen@linc.cis.upenn.edu,fxia@linc.cis.upenn.edu,chioufd@linc.cis.upenn.edu,mpalmer@linc.cis.upenn.edu;University of Pennsylvania, Philadelphia, PA 19104, USA e-mail: xueniwen@linc.cis.upenn.edu,fxia@linc.cis.upenn.edu,chioufd@linc.cis.upenn.edu,mpalmer@linc.cis.upenn.edu
Venue:
Natural Language Engineering
Year:
2005

Citing 26
Cited 79

Procedure for quantitatively comparing the syntactic coverage of English grammars

HLT '91 Proceedings of the workshop on Speech and Natural Language
A stochastic finite-state word-segmentation algorithm for Chinese

Computational Linguistics
The syntactic process

The syntactic process
Discriminative Reranking for Natural Language Parsing

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Maximum entropy models for natural language ambiguity resolution

Maximum entropy models for natural language ambiguity resolution
Automatic grammar generation from two different perspectives

Automatic grammar generation from two different perspectives
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Converting dependency structures to phrase structures

HLT '01 Proceedings of the first international conference on Human language technology research
Facilitating treebank annotation using a statistical parser

HLT '01 Proceedings of the first international conference on Human language technology research
Simple features for Chinese word sense disambiguation

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Building a large-scale annotated Chinese corpus

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Is it harder to parse Chinese, or the Chinese Treebank?

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Statistical parsing with an automatically-extracted tree adjoining grammar

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
The Penn Treebank: annotating predicate argument structure

HLT '94 Proceedings of the workshop on Human Language Technology
Two statistical parsing models applied to the Chinese Treebank

CLPW '00 Proceedings of the second workshop on Chinese language processing: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 12
Developing guidelines for the annotation of anaphors in the Chinese Treebank

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Combining classifiers for Chinese word segmentation

SIGHAN '02 Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
Annotating the propositions in the Penn Chinese Treebank

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Building a large Chinese corpus annotated with semantic dependency

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
The first international Chinese word segmentation Bakeoff

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
A maximum entropy Chinese character-based parser

EMNLP '03 Proceedings of the 2003 conference on Empirical methods in natural language processing
The Proposition Bank: An Annotated Corpus of Semantic Roles

Computational Linguistics
Automatically extracting and comparing lexicalized grammars for different languages

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Cooperatively evaluating portuguese morphology

PROPOR'03 Proceedings of the 6th international conference on Computational processing of the Portuguese language

A fast, accurate deterministic parser for Chinese

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Annealing structural bias in multilingual weighted grammar induction

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multilingual dependency parsing using Bayes Point Machines

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Discriminative classifiers for deterministic dependency parsing

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Aligning features with sense distinction dimensions

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Wide-coverage deep statistical parsing using automatic dependency structure annotation

Computational Linguistics
Labeling chinese predicates with semantic roles

Computational Linguistics
Probabilistic Models for Action-Based Chinese Dependency Parsing

ECML '07 Proceedings of the 18th European conference on Machine Learning
Using a Hybrid Convolution Tree Kernel for Semantic Role Labeling

ACM Transactions on Asian Language Information Processing (TALIP)
Definitional and human constraints on structural annotation of english*

Natural Language Engineering
Adding semantic roles to the chinese treebank

Natural Language Engineering
Fast Semantic Role Labeling for Chinese Based on Semantic Chunking

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Dependency-Based Chinese-English Statistical Machine Translation

CICLing '07 Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing
The CoNLL-2009 shared task: syntactic and semantic dependencies in multiple languages

CoNLL '09 Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task
Dependency-based n-gram models for general purpose sentence realisation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Linguistically annotated BTG for statistical machine translation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A parallel Proposition Bank II for Chinese and English

CorpusAnno '05 Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky
Annotating discourse connectives in the Chinese Treebank

CorpusAnno '05 Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky
TBL-improved non-deterministic segmentation and POS tagging for a Chinese parser

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Improving Chinese semantic role classification with hierarchical feature selection strategy

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Generalizing local and non-local word-reordering patterns for syntax-based machine translation

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic inference of the temporal location of situations in Chinese text

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Learning bilingual linguistic reordering model for statistical machine translation

NAACL '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improving a simple bigram HMM part-of-speech tagger by latent annotation and self-training

NAACL-Short '09 Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers
Automatic recognition of logical relations for English, Chinese and Japanese in the GLARF framework

DEW '09 Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions
Optimizing Chinese word segmentation for machine translation performance

StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation
Chinese syntactic reordering for adequate generation of Korean verbal phrases in Chinese-to-Korean SMT

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Annotation compatibility working group report

LAC '06 Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora 2006
Automatic semantic role labeling for Chinese verbs

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Exploiting heterogeneous treebanks for parsing

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Automatic adaptation of annotation standards for dependency parsing: using projected treebank as source corpus

IWPT '09 Proceedings of the 11th International Conference on Parsing Technologies
Transducing logical relations from automatic and manual GLARF

ACL-IJCNLP '09 Proceedings of the Third Linguistic Annotation Workshop
Self-training PCFG grammars with latent annotations across languages

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Accurate and robust LFG-based generation for Chinese

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
An Information-Extraction System for Urdu---A Resource-Poor Language

ACM Transactions on Asian Language Information Processing (TALIP)
Fast query for large treebanks

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Dependency parsing and projection based on word-pair classification

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Close = relevant?: the role of context in efficient language production

CMCL '10 Proceedings of the 2010 Workshop on Cognitive Modeling and Computational Linguistics
A tree kernel-based unified framework for Chinese zero anaphora resolution

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Chinese CCGbank: extracting CCG derivations from the Penn Chinese Treebank

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A comparison of unsupervised methods for part-of-speech tagging in Chinese

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Applying syntactic, semantic and discourse constraints in Chinese temporal annotation

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Chasing the ghost: recovering empty categories in the Chinese treebank

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Covariance in Unsupervised Learning of Probabilistic Grammars

The Journal of Machine Learning Research
A statistical tree annotator and its applications

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
A stacked sub-word model for joint Chinese word segmentation and part-of-speech tagging

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Parsing the internal structure of words: a new paradigm for Chinese word segmentation

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Language-independent parsing with empty elements

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Chinese sentence segmentation as comma classification

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Discourse-constrained temporal annotation

LAW V '11 Proceedings of the 5th Linguistic Annotation Workshop
Leveraging word confusion networks for named entity modeling and detection from conversational telephone speech

Speech Communication
Dependency-based n-gram models for general purpose sentence realisation

Natural Language Engineering
Quasi-synchronous phrase dependency grammars for machine translation

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Joint models for Chinese POS tagging and dependency parsing

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Relaxed cross-lingual projection of constituent syntax

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
A machine learning parser using an unlexicalized distituent model

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Analysis of the difficulties in Chinese deep parsing

IWPT '11 Proceedings of the 12th International Conference on Parsing Technologies
Adaptive Bayesian HMM for Fully Unsupervised Chinese Part-of-Speech Induction

ACM Transactions on Asian Language Information Processing (TALIP)
A dependency treebank of classical Chinese poems

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
The challenges of parsing Chinese with combinatory categorial grammar

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Getting more from morphology in multilingual dependency parsing

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Cross-lingual word clusters for direct transfer of linguistic structure

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
SemEval-2012 task 5: Chinese semantic dependency parsing

SemEval '12 Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation
A classical Chinese corpus with nested part-of-speech tags

LaTeCH '12 Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
PDTB-style discourse annotation of Chinese text

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Capturing paradigmatic and syntagmatic lexical relations: towards accurate Chinese part-of-speech tagging

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Exploiting multiple treebanks for parsing with quasi-synchronous grammars

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Chinese comma disambiguation for discourse analysis

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
Iterative annotation transformation with predict-self reestimation for Chinese word segmentation

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Improving NLP through marginalization of hidden syntactic structure

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
Exploring temporal vagueness with mechanical turk

LAW VI '12 Proceedings of the Sixth Linguistic Annotation Workshop
Finite-state chart constraints for reduced complexity context-free parsing pipelines

Computational Linguistics
How many multiword expressions do people know?

ACM Transactions on Speech and Language Processing (TSLP) - Special issue on multiword expressions: From theory to practice and use, part 1
An investigation of code-switching attitude dependent language modeling

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing
A clause-level hybrid approach to Chinese empty element recovery

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Joint Optimization for Chinese POS Tagging and Dependency Parsing

IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
A feature-based approach to better automatic treebank conversion

Language Resources and Evaluation
Unsupervised sub-tree alignment for tree-to-tree translation

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.