Intrinsic vs. extrinsic evaluation measures for referring expression generation

Authors:
Anja Belz;Albert Gatt
Affiliations:
University of Brighton, Brighton, UK;University of Aberdeen, Aberdeen, UK
Venue:
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Year:
2008

Citing 5
Cited 18

Cooking up referring expressions

ACL '89 Proceedings of the 27th annual meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Evaluation metrics for generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating algorithms for the generation of referring expressions using a balanced corpus

ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation

That's nice... what can you do with it?

Computational Linguistics
The software architecture for the first challenge on generating instructions in virtual environments

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Report on the first NLG Challenge on Generating Instructions in Virtual Environments (GIVE)

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
The TUNA-REG Challenge 2009: overview and evaluation results

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Evaluating description and reference strategies in a cooperative human-robot dialogue system

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Validating the web-based evaluation of NLG systems

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Computational Linguistics
Attribute selection for referring expression generation: new algorithms and evaluation methods

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Automated metrics that agree with human judgements on generated output for an embodied conversational agent

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
Charting the potential of description logic for the generation of referring expressions

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Towards an extrinsic evaluation of referring expressions in situated dialogs

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Introducing shared tasks to NLG: the TUNA shared task evaluation challenges

Empirical methods in natural language generation
The first challenge on generating instructions in virtual environments

Empirical methods in natural language generation
Does size matter: how much data is required to train a REG algorithm?

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Computational generation of referring expressions: A survey

Computational Linguistics
Combining symbolic and corpus-based approaches for the generation of successful referring expressions

ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Learning preferences for referring expression generation: effects of domain, language and algorithm

INLG '12 Proceedings of the Seventh International Natural Language Generation Conference
A task-performance evaluation of referring expressions in situated collaborative task dialogues

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present research in which we apply (i) the kind of intrinsic evaluation metrics that are characteristic of current comparative HLT evaluation, and (ii) extrinsic, human task-performance evaluations more in keeping with NLG traditions, to 15 systems implementing a language generation task. We analyse the evaluation results and find that there are no significant correlations between intrinsic and extrinsic evaluation measures for this task.