BLEU: a method for automatic evaluation of machine translation
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation
NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Evaluation metrics for generation
INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Evaluating machine translation with LFG dependencies
Machine Translation
Statistical ranking in tactical generation
EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Stochastic realisation ranking for a free word order language
ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
A dependency-driven parser for German dependency and constituency representations
PaGe '08 Proceedings of the Workshop on Parsing German
Evaluating coverage for large symbolic NLG grammars
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Probabilistic models for disambiguation of an HPSG-based chart generator
Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
DEPEVAL(summ): dependency-based evaluation for automatic summaries
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Incorporating information status into generation ranking
ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Assessing the trade-off between system building cost and output quality in data-to-text generation
Empirical methods in natural language generation
Empirical methods in natural language generation
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges
Empirical methods in natural language generation
Evaluating evaluation methods for generation in the presence of variation
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Empirical methods in natural language generation
The first challenge on generating instructions in virtual environments
Empirical methods in natural language generation
Generating varied narrative probability exercises
IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications
Hi-index | 0.00 |
In this chapter we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but that there are clearly also factors that make certain realisation alternatives more natural than others. We then examine correlations between native speaker judgements of automatically generated German text and automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the correlation between the human judgements and the automatic metrics was quite weak, the General Text Matcher (GTM) tool providing the only metric that correlates with the human judgements at a statistically significant level.