ORANGE: a method for evaluating automatic evaluation metrics for machine translation
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
System building cost vs. output quality in data-to-text generation
ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
The TUNA-REG Challenge 2009: overview and evaluation results
ENLG '09 Proceedings of the 12th European Workshop on Natural Language Generation
Choosing words in computer-generated weather forecasts
Artificial Intelligence - Special volume on connecting language to the world
DUC 2005: evaluation of question-focused summarization systems
SumQA '06 Proceedings of the Workshop on Task-Focused Summarization and Question Answering
The GREC named entity generation challenge 2009: overview and evaluation results
UCNLG+Sum '09 Proceedings of the 2009 Workshop on Language Generation and Summarisation
Towards a framework for abstractive summarization of multimodal documents
HLT-SS '11 Proceedings of the ACL 2011 Student Session
Underspecifying and predicting voice for surface realisation ranking
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Question generation shared task and evaluation challenge: status report
ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Hi-index | 0.02 |
Rating-scale evaluations are common in NLP, but are problematic for a range of reasons, e.g. they can be unintuitive for evaluators, inter-evaluator agreement and self-consistency tend to be low, and the parametric statistics commonly applied to the results are not generally considered appropriate for ordinal data. In this paper, we compare rating scales with an alternative evaluation paradigm, preference-strength judgement experiments (PJEs), where evaluators have the simpler task of deciding which of two texts is better in terms of a given quality criterion. We present three pairs of evaluation experiments assessing text fluency and clarity for different data sets, where one of each pair of experiments is a rating-scale experiment, and the other is a PJE. We find the PJE versions of the experiments have better evaluator self-consistency and inter-evaluator agreement, and a larger proportion of variation accounted for by system differences, resulting in a larger number of significant differences being found.