Correlating human and automatic evaluation of a German surface realiser

Authors:
Aoife Cahill
Affiliations:
University of Stuttgart, Germany
Venue:
ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Year:
2009

Citing 9
Cited 5

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Evaluating machine translation with LFG dependencies

Machine Translation
Human evaluation of a German surface realisation ranker

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Stochastic realisation ranking for a free word order language

ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
A dependency-driven parser for German dependency and constituency representations

PaGe '08 Proceedings of the Workshop on Parsing German
An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Computational Linguistics
DEPEVAL(summ): dependency-based evaluation for automatic summaries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Evaluating evaluation methods for generation in the presence of variation

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Further meta-evaluation of broad-coverage surface realization

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Feature selection for fluency ranking

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Reversible stochastic attribute-value grammars

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Discriminative features in reversible stochastic attribute-value grammars

UCNLG+EVAL '11 Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop
To what extent does sentence-internal realisation reflect discourse context?: a study on word order

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We examine correlations between native speaker judgements on automatically generated German text against automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the General Text Matcher (GTM) tool correlates best overall, although in general, correlation between the human judgements and the automatic metrics was quite weak.