Further meta-evaluation of broad-coverage surface realization

Authors:
Dominic Espinosa;Rajakrishnan Rajkumar;Michael White;Shoshana Berleant
Affiliations:
The Ohio State University, Columbus, Ohio;The Ohio State University, Columbus, Ohio;The Ohio State University, Columbus, Ohio;The Ohio State University, Columbus, Ohio
Venue:
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Year:
2010

Citing 11
Cited 1

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Parsing the wall street journal using a Lexical-Functional Grammar and discriminative estimation techniques

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
ORANGE: a method for evaluating automatic evaluation metrics for machine translation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
ParaMetric: an automatic evaluation metric for paraphrasing

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
(Meta-) evaluation of machine translation

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric

StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Correlating human and automatic evaluation of a German surface realiser

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Computational Linguistics
Perceptron reranking for CCG realization

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Evaluating evaluation methods for generation in the presence of variation

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Shared task proposal: syntactic paraphrase ranking

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion of the implications for the utility of such metrics in evaluating generation in the presence of variation. A further result of our research is a corpus of post-edited realizations, which will be made available to the research community.