Human judgment as a parameter in evaluation campaigns

Authors:
Jean-Baptiste Berthelin;Cyril Grouin;Martine Hurault-Plantet;Patrick Paroubek
Affiliations:
LIMSI-CNRS, Orsay Cedex;LIMSI-CNRS, Orsay Cedex;LIMSI-CNRS, Orsay Cedex;LIMSI-CNRS, Orsay Cedex
Venue:
HumanJudge '08 Proceedings of the Workshop on Human Judgements in Computational Linguistics
Year:
2008

Citing 4
Cited 0

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Principles of Context-Based Machine Translation Evaluation

Machine Translation
Toward evaluation of writing style: finding overly repetitive word use in student essays

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Empirical methods for evaluating dialog systems

ELDS '01 Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9

Quantified Score

Hi-index	0.00

Visualization

Abstract

The relevance of human judgment in an evaluation campaign is illustrated here through the DEFT text mining campaigns. In a first step, testing a topic for a campaign among a limited number of human evaluators informs us about the feasibility of a task. This information comes from the results obtained by the judges, as well as from their personal impressions after passing the test. In a second step, results from individual judges, as well as their pairwise matching, are used in order to adjust the task (choice of a marking scale for DEFT'07 and selection of topical categories for DEFT'08). Finally, the mutual comparison of competitors' results, at the end of the evaluation campaign, confirms the choices we made at its starting point, and provides means to redefine the task when we shall launch a future campaign based on the same topic.