MURAX: a robust linguistic approach for question answering using an on-line encyclopedia
SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic language and information processing: rethinking evaluation
Natural Language Engineering
Evaluation of resources for question answering evaluation
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Journal of the American Society for Information Science and Technology
A model for quantitative evaluation of an end-to-end question-answering system
Journal of the American Society for Information Science and Technology
Intra-assessor consistency in question answering
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
User-centered evaluation of interactive question answering systems
IQA '06 Proceedings of the Interactive Question Answering Workshop at HLT-NAACL 2006
INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part IV
Hi-index | 0.00 |
Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.