Assessing agreement on classification tasks: the kappa statistic
Computational Linguistics
Principles of Context-Based Machine Translation Evaluation
Machine Translation
Toward evaluation of writing style: finding overly repetitive word use in student essays
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Empirical methods for evaluating dialog systems
ELDS '01 Proceedings of the workshop on Evaluation for Language and Dialogue Systems - Volume 9
Hi-index | 0.00 |
The relevance of human judgment in an evaluation campaign is illustrated here through the DEFT text mining campaigns. In a first step, testing a topic for a campaign among a limited number of human evaluators informs us about the feasibility of a task. This information comes from the results obtained by the judges, as well as from their personal impressions after passing the test. In a second step, results from individual judges, as well as their pairwise matching, are used in order to adjust the task (choice of a marking scale for DEFT'07 and selection of topical categories for DEFT'08). Finally, the mutual comparison of competitors' results, at the end of the evaluation campaign, confirms the choices we made at its starting point, and provides means to redefine the task when we shall launch a future campaign based on the same topic.