Estimating average precision when judgments are incomplete

Authors:
Emine Yilmaz;Javed A. Aslam
Affiliations:
Northeastern University, College of Computer and Information Science, 02115, Boston, MA, USA;Northeastern University, College of Computer and Information Science, 02115, Boston, MA, USA
Venue:
Knowledge and Information Systems
Year:
2008

Citing 17
Cited 8

A critical investigation of recall and precision as measures of retrieval system performance

ACM Transactions on Information Systems (TOIS)
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval

21st Annual ACM/SIGIR International Conference on Research and Development in Information Retrieval
Efficient construction of large test collections

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
On Collection Size and Retrieval Effectiveness

Information Retrieval
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
A unified model for metasearch and the efficient evaluation of retrieval systems via the hedge algorithm

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Current Status of the Evaluation of Information Retrieval

Journal of Medical Systems
A unified model for metasearch, pooling, and system evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Query-sensitive similarity measures for information retrieval

Knowledge and Information Systems
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Quality assessment of individual classifications in machine learning and data mining

Knowledge and Information Systems
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management

Concept-Based Video Retrieval

Foundations and Trends in Information Retrieval
On the choice of effectiveness measures for learning to rank

Information Retrieval
Biomedical information retrieval: the BioTracer approach

ITBAM'10 Proceedings of the First international conference on Information technology in bio- and medical informatics
A Bayesian network modeling approach for cross media analysis

Image Communication
Supporting biomedical information retrieval: the bioTracer approach

Transactions on large-scale data- and knowledge-centered systems IV
IR system evaluation using nugget-based test collections

Proceedings of the fifth ACM international conference on Web search and data mining
Continuous improvement of knowledge management systems using Six Sigma methodology

Robotics and Computer-Integrated Manufacturing
Bridging memory-based collaborative filtering and text retrieval

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of evaluating retrieval systems with incomplete relevance judgments. Recently, Buckley and Voorhees showed that standard measures of retrieval performance are not robust to incomplete judgments, and they proposed a new measure, bpref, that is much more robust to incomplete judgments. Although bpref is highly correlated with average precision when the judgments are effectively complete, the value of bpref deviates from average precision and from its own value as the judgment set degrades, especially at very low levels of assessment. In this work, we propose three new evaluation measures induced AP, subcollection AP, and inferred AP that are equivalent to average precision when the relevance judgments are complete and that are statistical estimates of average precision when relevance judgments are a random subset of complete judgments. We consider natural scenarios which yield highly incomplete judgments such as random judgment sets or very shallow depth pools. We compare and contrast the robustness of the three measures proposed in this work with bpref for both of these scenarios. Through the use of TREC data, we demonstrate that these measures are more robust to incomplete relevance judgments than bpref, both in terms of how well the measures estimate average precision (as measured with complete relevance judgments) and how well they estimate themselves (as measured with complete relevance judgments). Finally, since inferred AP is the most accurate approximation to average precision and the most robust measure in the presence of incomplete judgments, we provide a detailed analysis of this measure, both in terms of its behavior in theory and its implementation in practice.