Evaluating the evaluation: a case study using the TREC 2002 question answering track

Authors:
Ellen M. Voorhees
Affiliations:
National Institute of Standards and Technology, Gaithersburg, MD
Venue:
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Year:
2003

Citing 5
Cited 8

MURAX: a robust linguistic approach for question answering using an on-line encyclopedia

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic language and information processing: rethinking evaluation

Natural Language Engineering

Evaluation of resources for question answering evaluation

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Using interview data to identify evaluation criteria for interactive, analytical question-answering systems

Journal of the American Society for Information Science and Technology
A model for quantitative evaluation of an end-to-end question-answering system

Journal of the American Society for Information Science and Technology
Intra-assessor consistency in question answering

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
User-centered evaluation of interactive question answering systems

IQA '06 Proceedings of the Interactive Question Answering Workshop at HLT-NAACL 2006
Overview of the 2009 QA track: towards a common task for QA, focused IR and automatic summarization systems

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Evaluation and NLP

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
Evaluation and NLP

KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part IV

Quantified Score

Hi-index	0.00

Visualization

Abstract

Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.