Evaluation of system measures for incomplete relevance judgment in IR

Authors:
Shengli Wu;Sally McClean
Affiliations:
School of Computing and Mathematics, University of Ulster, UK;School of Computing and Mathematics, University of Ulster, UK
Venue:
FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Year:
2006

Citing 15
Cited 2

A re-examination of relevance: toward a dynamic, situational definition

Information Processing and Management: an International Journal
User-defined relevance criteria: an exploratory study

Journal of the American Society for Information Science - Special issue: relevance research
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Overview of the sixth text REtrieval conference (TREC-6)

Information Processing and Management: an International Journal - The sixth text REtrieval conference (TREC-6)
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
A new unified probabilistic model

Journal of the American Society for Information Science and Technology
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval with a Hybrid Automatic Query Expansion and Data Fusion Procedure

Information Retrieval
Probabilistic information retrieval model for a dependency structured indexing system

Information Processing and Management: an International Journal
The maximum entropy method for analyzing retrieval measures

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A geometric interpretation and analysis of R-precision

Proceedings of the 14th ACM international conference on Information and knowledge management
Binary and graded relevance in IR evaluations-Comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal

Retrieval result presentation and evaluation

KSEM'10 Proceedings of the 4th international conference on Knowledge science, engineering and management
Using the euclidean distance for retrieval evaluation

BNCOD'11 Proceedings of the 28th British national conference on Advances in databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Incomplete relevance judgment has become a norm for the evaluation of some major information retrieval evaluation events such as TREC, but its effect on some system measures has not been well understood. In this paper, we evaluate four system measures, namely mean average precision, R-precision, normalized average precision over all documents, and normalized discount cumulative gain, under incomplete relevance judgment. Among them, the measure of normalized average precision over all documents is introduced, and both mean average precision and R-precision are generalized for graded relevance judgment. These four measures have a common characteristic: complete relevance judgment is required for the calculation of their accurate values. We empirically investigate these measures through extensive experimentation of TREC data and aim to find the effect of incomplete relevance judgment on them. From these experiments, we conclude that incomplete relevance judgment affects all these four measures' values significantly. When using the pooling method in TREC, the more incomplete the relevance judgment is, the higher the values of all these measures usually become. We also conclude that mean average precision is the most sensitive but least reliable measure, normalized discount cumulative gain and normalized average precision over all documents are the most reliable but least sensitive measures, while R-precision is in the middle.