Evaluation of resources for question answering evaluation

Authors:
Jimmy Lin
Affiliations:
University of Maryland, College Park, MD
Venue:
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2005

Citing 12
Cited 11

Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Building a question answering test collection

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The role of context in question answering systems

CHI '03 Extended Abstracts on Human Factors in Computing Systems
Corpora for topic detection and tracking

Topic detection and tracking
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating the evaluation: a case study using the TREC 2002 question answering track

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Building a reusable test collection for question answering

Journal of the American Society for Information Science and Technology - Research Articles

Building a reusable test collection for question answering

Journal of the American Society for Information Science and Technology - Research Articles
Answer extraction, semantic clustering, and extractive summarization for clinical question answering

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
User simulations for evaluating answers to question series

Information Processing and Management: an International Journal
An exploration of the principles underlying redundancy-based factoid question answering

ACM Transactions on Information Systems (TOIS)
Deconstructing nuggets: the stability and reliability of complex question answering evaluation

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Semantic verification in an online fact seeking environment

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Boosting Chinese Question Answering with Two Lightweight Methods: ABSPs and SCO-QAT

ACM Transactions on Asian Language Information Processing (TALIP)
Alignment-based surface patterns for factoid question answering systems

Integrated Computer-Aided Engineering - Selected papers from the IEEE Conference on Information Reuse and Integration (IRI), July 13-15, 2008
Exploring models for semantic category verification

Information Systems
Exploring models for semantic category verification

Information Systems
Towards semantic category verification with arbitrary precision

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Controlled and reproducible laboratory experiments, enabled by reusable test collections, represent a well-established methodology in modern information retrieval research. In order to confidently draw conclusions about the performance of different retrieval methods using test collections, their reliability and trustworthiness must first be established. Although such studies have been performed for ad hoc test collections, currently available resources for evaluating question answering systems have not been similarly analyzed. This study evaluates the quality of answer patterns and lists of relevant documents currently employed in automatic question answering evaluation, and concludes that they are not suitable for post-hoc experimentation. These resources, created from runs submitted by TREC QA track participants, do not produce fair and reliable assessments of systems that did not participate in the original evaluations. Potential solutions for addressing this evaluation gap and their shortcomings are discussed.