Evaluating a statistical CCG parser on Wikipedia

Authors:
Matthew Honnibal;Joel Nothman;James R. Curran
Affiliations:
University of Sydney, NSW, Australia;University of Sydney, NSW, Australia;University of Sydney, NSW, Australia
Venue:
People's Web '09 Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources
Year:
2009

Citing 8
Cited 3

The syntactic process

The syntactic process
Supertagging: an approach to almost parsing

Computational Linguistics
Efficient normal-form parsing for combinatory categorial grammar

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Example selection for bootstrapping statistical parsers

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Unsupervised Multilingual Sentence Boundary Detection

Computational Linguistics
CCGbank: A Corpus of CCG Derivations and Dependency Structures Extracted from the Penn Treebank

Computational Linguistics
Wide-coverage efficient statistical parsing with ccg and log-linear models

Computational Linguistics
Porting a lexicalized-grammar parser to the biomedical domain

Journal of Biomedical Informatics

SCHWA: PETE using CCG dependencies with the C&C parser

SemEval '10 Proceedings of the 5th International Workshop on Semantic Evaluation
Cross-Domain Effects on Parse Selection for Precision Grammars

Research on Language and Computation
Data mining from a patient safety database: the lessons learned

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The vast majority of parser evaluation is conducted on the 1984 Wall Street Journal (WSJ). In-domain evaluation of this kind is important for system development, but gives little indication about how the parser will perform on many practical problems. Wikipedia is an interesting domain for parsing that has so far been under-explored. We present statistical parsing results that for the first time provide information about what sort of performance a user parsing Wikipedia text can expect. We find that the C&C parser's standard model is 4.3% less accurate on Wikipedia text, but that a simple self-training exercise reduces the gap to 3.8%. The self-training also speeds up the parser on newswire text by 20%.