Human evaluation of a german surface realisation ranker

  • Authors:
  • Aoife Cahill;Martin Forst

  • Affiliations:
  • Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Stuttgart, Germany;Powerset, Microsoft, San Francisco, CA

  • Venue:
  • Empirical methods in natural language generation
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this chapter we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but that there are clearly also factors that make certain realisation alternatives more natural than others. We then examine correlations between native speaker judgements of automatically generated German text and automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the correlation between the human judgements and the automatic metrics was quite weak, the General Text Matcher (GTM) tool providing the only metric that correlates with the human judgements at a statistically significant level.