Using question series to evaluate question answering system effectiveness

Authors:
Ellen M. Voorhees
Affiliations:
National Institute of Standards and Technology, Gaithersburg, MD
Venue:
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Year:
2005

Citing 2
Cited 6

The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating answers to definition questions

NAACL-Short '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003--short papers - Volume 2

Answer extraction, semantic clustering, and extractive summarization for clinical question answering

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Utilizing co-occurrence of answers in question answering

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Will pyramids built of nuggets topple over?

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
User simulations for evaluating answers to question series

Information Processing and Management: an International Journal
Semantically driven snippet selection for supporting focused web searches

Data & Knowledge Engineering
Generating extractive summaries of scientific paradigms

Journal of Artificial Intelligence Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

The original motivation for using question series in the TREC 2004 question answering track was the desire to model aspects of dialogue processing in an evaluation task that included different question types. The structure introduced by the series also proved to have an important additional benefit: the series is at an appropriate level of granularity for aggregating scores for an effective evaluation. The series is small enough to be meaningful at the task level since it represents a single user interaction, yet it is large enough to avoid the highly skewed score distributions exhibited by single questions. An analysis of the reliability of the per-series evaluation shows the evaluation is stable for differences in scores seen in the track.