On the robustness of relevance measures with incomplete judgments

Authors:
Tanuja Bompada;Chi-Chao Chang;John Chen;Ravi Kumar;Rajesh Shenoy
Affiliations:
Yahoo!, Sunnyvale, CA;Yahoo!, Sunnyvale, CA;Yahoo!, Sunnyvale, CA;Yahoo!, Sunnyvale, CA;Yahoo!, Sunnyvale, CA
Venue:
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2007

Citing 16
Cited 7

Statistical inference in retrieval effectiveness evaluation

Information Processing and Management: an International Journal
How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Scaling IR-system evaluation using term relevance sets

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Combining eye movements and collaborative filtering for proactive information retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Exploring cost-effective approaches to human evaluation of search engine relevance

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
A simple and efficient sampling method for estimating AP and NDCG

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Measuring the Search Effectiveness of a Breadth-First Crawl

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Concept-Based Video Retrieval

Foundations and Trends in Information Retrieval
Examining the robustness of evaluation metrics for patent retrieval with incomplete relevance judgements

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Evaluation effort, reliability and reusability in XML retrieval

Journal of the American Society for Information Science and Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the robustness of three widely used IR relevance measures for large data collections with incomplete judgments. The relevance measures we consider are the bpref measure introduced by Buckley and Voorhees [7], the inferred average precision (infAP) introduced by Aslam and Yilmaz [4], and the normalized discounted cumulative gain (NDCG) measure introduced by Järvelin and Kekäläinen [8]. Our main results show that NDCG consistently performs better than both bpref and infAP. The experiments are performed on standard TREC datasets, under different levels of incompleteness of judgments, and using two different evaluation methods, namely, the Kendall correlation measures order between system rankings and pairwise statistical significance testing; the latter may be of independent interest.