Evaluating evaluation methods for generation in the presence of variation

Authors:
Amanda Stent;Matthew Marge;Mohit Singhai
Affiliations:
Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY;Stony Brook University, Stony Brook, NY
Venue:
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2005

Citing 11
Cited 13

Controlling Content Realization with Functional Unification Grammars

Proceedings of the 6th International Workshop on Natural Language Generation: Aspects of Automated Natural Language Generation
Forest-based statistical sentence generation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Paraphrasing using given and new information in a question-answer system

ACL '79 Proceedings of the 17th annual meeting on Association for Computational Linguistics
Exploiting a probabilistic hierarchical model for generation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Extracting paraphrases from a parallel corpus

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Learning to paraphrase: an unsupervised approach using multiple-sequence alignment

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Syntax-based alignment of multiple translations: extracting paraphrases and generating new sentences

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Evaluation metrics for generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Extracting structural paraphrases from aligned monolingual corpora

PARAPHRASE '03 Proceedings of the second international workshop on Paraphrasing - Volume 16
Automatic paraphrase acquisition from news articles

HLT '02 Proceedings of the second international conference on Human Language Technology Research
Evaluating coverage for large symbolic NLG grammars

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Paraphrasing for automatic evaluation

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
The software architecture for the first challenge on generating instructions in virtual environments

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Avoiding repetition in generated text

ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
Report on the first NLG Challenge on Generating Instructions in Virtual Environments (GIVE)

ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Correlating human and automatic evaluation of a German surface realiser

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
An investigation into the validity of some metrics for automatically evaluating natural language generation systems

Computational Linguistics
Automated metrics that agree with human judgements on generated output for an embodied conversational agent

INLG '08 Proceedings of the Fifth International Natural Language Generation Conference
English to Malayalam translation: a statistical approach

Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India
Further meta-evaluation of broad-coverage surface realization

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
User preferences can drive facial expressions: evaluating an embodied conversational agent in a recommender dialogue system

User Modeling and User-Adapted Interaction
Towards an extrinsic evaluation of referring expressions in situated dialogs

INLG '10 Proceedings of the 6th International Natural Language Generation Conference
Human evaluation of a german surface realisation ranker

Empirical methods in natural language generation
The first challenge on generating instructions in virtual environments

Empirical methods in natural language generation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent years have seen increasing interest in automatic metrics for the evaluation of generation systems. When a system can generate syntactic variation, automatic evaluation becomes more difficult. In this paper, we compare the performance of several automatic evaluation metrics using a corpus of automatically generated paraphrases. We show that these evaluation metrics can at least partially measure adequacy (similarity in meaning), but are not good measures of fluency (syntactic correctness). We make several proposals for improving the evaluation of generation systems that produce variation.