Evaluating and integrating treebank parsers on a biomedical corpus

Authors:
Andrew B. Clegg;Adrian J. Shepherd
Affiliations:
University of London, London, UK;University of London, London, UK
Venue:
Software '05 Proceedings of the Workshop on Software
Year:
2005

Citing 8
Cited 11

Information Retrieval

Information Retrieval
Two biomedical sublanguages: a description based on the theories of Zellig Harris

Journal of Biomedical Informatics - Special issue: Sublanguage
Head-driven statistical models for natural language parsing

Head-driven statistical models for natural language parsing
Exploiting diversity for natural language parsing

Exploiting diversity for natural language parsing
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
A test of the leaf-ancestor metric for parse accuracy

Natural Language Engineering
MedPost: a part-of-speech tagger for bioMedical text

Bioinformatics
Design of a multi-lingual, parallel-processing statistical parsing engine

HLT '02 Proceedings of the second international conference on Human Language Technology Research

Reranking and self-training for parser adaptation

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Syntactic sentence compression in the biomedical domain: facilitating access to related articles

Information Retrieval
Self-training for biomedical parsing

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Evaluating impact of re-training a lexical disambiguation model on domain adaptation of an HPSG parser

IWPT '07 Proceedings of the 10th International Conference on Parsing Technologies
Porting a lexicalized-grammar parser to the biomedical domain

Journal of Biomedical Informatics
Evaluating the impact of alternative dependency graph encodings on solving event extraction tasks

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Better Arabic parsing: baselines, evaluations, and analysis

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Parsing natural language queries for life science knowledge

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Cross-Domain Effects on Parse Selection for Precision Grammars

Research on Language and Computation
GeneTUC, GENIA and google: natural language understanding in molecular biology literature

Transactions on Computational Systems Biology V
Parser showdown at the wall street corral: an empirical investigation of error types in parser output

EMNLP-CoNLL '12 Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is not clear a priori how well parsers trained on the Penn Treebank will parse significantly different corpora without retraining. We carried out a competitive evaluation of three leading treebank parsers on an annotated corpus from the human molecular biology domain, and on an extract from the Penn Treebank for comparison, performing a detailed analysis of the kinds of errors each parser made, along with a quantitative comparison of syntax usage between the two corpora. Our results suggest that these tools are becoming somewhat over-specialised on their training domain at the expense of portability, but also indicate that some of the errors encountered are of doubtful importance for information extraction tasks. Furthermore, our inital experiments with unsupervised parse combination techniques showed that integrating the output of several parsers can ameliorate some of the performance problems they encounter on unfamiliar text, providing accuracy and coverage improvements, and a novel measure of trustworthiness. Supplementary materials are available at http://textmining.cryst.bbk.ac.uk/ac105/.