Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The role of context in question answering systems
CHI '03 Extended Abstracts on Human Factors in Computing Systems
Corpora for topic detection and tracking
Topic detection and tracking
Retrieval evaluation with incomplete information
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating the evaluation: a case study using the TREC 2002 question answering track
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Building a reusable test collection for question answering
Journal of the American Society for Information Science and Technology - Research Articles
Building a reusable test collection for question answering
Journal of the American Society for Information Science and Technology - Research Articles
Answer extraction, semantic clustering, and extractive summarization for clinical question answering
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
User simulations for evaluating answers to question series
Information Processing and Management: an International Journal
An exploration of the principles underlying redundancy-based factoid question answering
ACM Transactions on Information Systems (TOIS)
Deconstructing nuggets: the stability and reliability of complex question answering evaluation
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic verification in an online fact seeking environment
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Boosting Chinese Question Answering with Two Lightweight Methods: ABSPs and SCO-QAT
ACM Transactions on Asian Language Information Processing (TALIP)
Alignment-based surface patterns for factoid question answering systems
Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Exploring models for semantic category verification
Information Systems
Exploring models for semantic category verification
Information Systems
Towards semantic category verification with arbitrary precision
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Hi-index | 0.00 |
Controlled and reproducible laboratory experiments, enabled by reusable test collections, represent a well-established methodology in modern information retrieval research. In order to confidently draw conclusions about the performance of different retrieval methods using test collections, their reliability and trustworthiness must first be established. Although such studies have been performed for ad hoc test collections, currently available resources for evaluating question answering systems have not been similarly analyzed. This study evaluates the quality of answer patterns and lists of relevant documents currently employed in automatic question answering evaluation, and concludes that they are not suitable for post-hoc experimentation. These resources, created from runs submitted by TREC QA track participants, do not produce fair and reliable assessments of systems that did not participate in the original evaluations. Potential solutions for addressing this evaluation gap and their shortcomings are discussed.