BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation
NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Evaluating machine translation with LFG dependencies
Machine Translation
Human evaluation of a German surface realisation ranker
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Stochastic realisation ranking for a free word order language
ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
A dependency-driven parser for German dependency and constituency representations
PaGe '08 Proceedings of the Workshop on Parsing German
DEPEVAL(summ): dependency-based evaluation for automatic summaries
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Evaluating evaluation methods for generation in the presence of variation
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Further meta-evaluation of broad-coverage surface realization
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Feature selection for fluency ranking
INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Reversible stochastic attribute-value grammars
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Discriminative features in reversible stochastic attribute-value grammars
UCNLG+EVAL '11 Proceedings of the UCNLG+Eval: Language Generation and Evaluation Workshop
To what extent does sentence-internal realisation reflect discourse context?: a study on word order
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Hi-index | 0.00 |
We examine correlations between native speaker judgements on automatically generated German text against automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the General Text Matcher (GTM) tool correlates best overall, although in general, correlation between the human judgements and the automatic metrics was quite weak.