On the reliability of factoid question answering evaluation

  • Authors:
  • Tetsuya Sakai

  • Affiliations:
  • Knowledge Media Laboratory, Toshiba Corporate R&D Center, JAPAN

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.