Human evaluation of a german surface realisation ranker

Authors:
Aoife Cahill;Martin Forst
Affiliations:
Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany;Powerset, Microsoft, San Francisco, CA
Venue:
Empirical methods in natural language generation
Year:
2010

Citing 15
Cited 3

BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Precision and recall of machine translation

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2
Evaluation metrics for generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Evaluating machine translation with LFG dependencies

Machine Translation
Statistical ranking in tactical generation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Stochastic realisation ranking for a free word order language

ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
A dependency-driven parser for German dependency and constituency representations

PaGe '08 Proceedings of the Workshop on Parsing German
Evaluating coverage for large symbolic NLG grammars

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Probabilistic models for disambiguation of an HPSG-based chart generator

Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
DEPEVAL(summ): dependency-based evaluation for automatic summaries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Incorporating information status into generation ranking

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Assessing the trade-off between system building cost and output quality in data-to-text generation

Empirical methods in natural language generation
Structural features for predicting the linguistic quality of text: applications to machine translation, automatic summarization and human-authored text

Empirical methods in natural language generation
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges

Empirical methods in natural language generation
Evaluating evaluation methods for generation in the presence of variation

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Structural features for predicting the linguistic quality of text: applications to machine translation, automatic summarization and human-authored text

Empirical methods in natural language generation
The first challenge on generating instructions in virtual environments

Empirical methods in natural language generation
Generating varied narrative probability exercises

IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this chapter we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but that there are clearly also factors that make certain realisation alternatives more natural than others. We then examine correlations between native speaker judgements of automatically generated German text and automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the correlation between the human judgements and the automatic metrics was quite weak, the General Text Matcher (GTM) tool providing the only metric that correlates with the human judgements at a statistically significant level.