Presenting results of experimental retrieval comparisons
Information Processing and Management: an International Journal - Special issue on evaluation issues in information retrieval
Evaluating evaluation measure stability
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Robust Classification for Imprecise Environments
Machine Learning
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The structure and performance of an open-domain question answering system
ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Evaluating evaluation metrics based on the bootstrap
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Methods for using textual entailment in open-domain question answering
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
On the reliability of information retrieval metrics based on graded relevance
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Testing the Reasoning for Question Answering Validation
Journal of Logic and Computation
Overview of the answer validation exercise 2008
CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
A simple measure to assess non-response
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Hi-index | 0.00 |
Formulating Question Answering Validation as a classification problem facilitates the introduction of Machine Learning techniques to improve the overall performance of Question Answering systems. The different proportion of positive and negative examples in the evaluation collections has led to the use of measures based on precision and recall. However, an evaluation based on the analysis of Receiver Operating Characteristic (ROC) space is sometimes preferred in classification with unbalanced collections. In this article we compare both evaluation approaches according to their rationale, their stability, their discrimination power and their adequacy to the particularities of the Answer Validation task.