Estimating average precision when judgments are incomplete

  • Authors:
  • Emine Yilmaz;Javed A. Aslam

  • Affiliations:
  • Northeastern University, College of Computer and Information Science, 02115, Boston, MA, USA;Northeastern University, College of Computer and Information Science, 02115, Boston, MA, USA

  • Venue:
  • Knowledge and Information Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, bpref, that is much more robust to incomplete judgments. Although bpref is highly correlated with average precision when the judgments are effectively complete, the value of bpref deviates from average precision and from its own value as the judgment set degrades, especially at very low levels of assessment. In this work, we propose three new evaluation measures induced AP, subcollection AP, and inferred AP that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates of average precision when relevance judgments are a random subset of complete judgments. We consider natural scenarios which yield highly incomplete judgments such as random judgment sets or very shallow depth pools. We compare and contrast the robustness of the three measures proposed in this work with bpref for both of these scenarios. Through the use of TREC data, we demonstrate that these measures are more robust to incomplete relevance judgments than bpref, both in terms of how well the measures estimate average precision (as measured with complete relevance judgments) and how well they estimate themselves (as measured with complete relevance judgments). Finally, since inferred AP is the most accurate approximation to average precision and the most robust measure in the presence of incomplete judgments, we provide a detailed analysis of this measure, both in terms of its behavior in theory and its implementation in practice.