Evaluation of system measures for incomplete relevance judgment in IR

  • Authors:
  • Shengli Wu;Sally McClean

  • Affiliations:
  • School of Computing and Mathematics, University of Ulster, UK;School of Computing and Mathematics, University of Ulster, UK

  • Venue:
  • FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures' values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.