Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
ORANGE: a method for evaluating automatic evaluation metrics for machine translation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
ParaMetric: an automatic evaluation metric for paraphrasing
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
(Meta-) evaluation of machine translation
StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
Fluency, adequacy, or HTER?: exploring different human judgments with a tunable MT metric
StatMT '09 Proceedings of the Fourth Workshop on Statistical Machine Translation
Correlating human and automatic evaluation of a German surface realiser
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Perceptron reranking for CCG realization
EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1
Evaluating evaluation methods for generation in the presence of variation
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Shared task proposal: syntactic paraphrase ranking
INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
Hi-index | 0.00 |
We present the first evaluation of the utility of automatic evaluation metrics on surface realizations of Penn Treebank data. Using outputs of the OpenCCG and XLE realizers, along with ranked WordNet synonym substitutions, we collected a corpus of generated surface realizations. These outputs were then rated and post-edited by human annotators. We evaluated the realizations using seven automatic metrics, and analyzed correlations obtained between the human judgments and the automatic scores. In contrast to previous NLG meta-evaluations, we find that several of the metrics correlate moderately well with human judgments of both adequacy and fluency, with the TER family performing best overall. We also find that all of the metrics correctly predict more than half of the significant systemlevel differences, though none are correct in all cases. We conclude with a discussion of the implications for the utility of such metrics in evaluating generation in the presence of variation. A further result of our research is a corpus of post-edited realizations, which will be made available to the research community.