Retrieval system evaluation: automatic evaluation versus incomplete judgments

Authors:
Claudia Hauff;Franciska de Jong
Affiliations:
University of Twente, Enschede, Netherlands;University of Twente, Enschede, Netherlands
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 5
Cited 0

Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
On the effectiveness of evaluating retrieval systems in the absence of relevance judgments

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A case for automatic system evaluation

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In information retrieval (IR), research aiming to reduce the cost of retrieval system evaluations has been conducted along two lines: (i) the evaluation of IR systems with reduced amounts of manual relevance assessments, and (ii) the fully automatic evaluation of IR systems, thus foregoing the need for manual assessments altogether. The proposed methods in both areas are commonly evaluated by comparing their performance estimates for a set of systems to a ground truth (provided for instance by evaluating the set of systems according to mean average precision). In contrast, in this poster we compare an automatic system evaluation approach directly to two evaluations based on incomplete manual relevance assessments. For the particular case of TREC's Million Query track, we show that the automatic evaluation leads to results which are highly correlated to those achieved by approaches relying on incomplete manual judgments.