How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing and Management: an International Journal
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of resources for question answering evaluation
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Building a reusable test collection for question answering
Journal of the American Society for Information Science and Technology - Research Articles
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Automatically evaluating answers to definition questions
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Will pyramids built of nuggets topple over?
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Predicting information seeker satisfaction in community question answering
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Modeling information-seeker satisfaction in community question answering
ACM Transactions on Knowledge Discovery from Data (TKDD)
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Hi-index | 0.00 |
A methodology based on "information nuggets" has recently emerged as the de facto standard by which answers to complex questions are evaluated. After several implementations in the TREC question answering tracks, the community has gained a better understanding of its many characteristics. This paper focuses on one particular aspect of the evaluation: the human assignment of nuggets to answer strings, which serves as the basis of the F-score computation. As a byproduct of the TREC 2006 ciQA task, identical answer strings were independently evaluated twice, which allowed us to assess the consistency of human judgments. Based on these results, we explored simulations of assessor behavior that provide a method to quantify scoring variations. Understanding these variations in turn lets researchers be more confident in their comparisons of systems.