The reliability of metrics based on graded relevance

Authors:
Tetsuya Sakai
Affiliations:
Toshiba Corporate R&D Center, Kawasaki, Japan
Venue:
AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
Year:
2005

Citing 7
Cited 3

Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Measuring retrieval effectiveness: a new proposal and a first experimental validation

Journal of the American Society for Information Science and Technology
Binary and graded relevance in IR evaluations-Comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal
Ranking the NTCIR systems based on multigrade relevance

AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology

eXtended cumulated gain measures for the evaluation of content-oriented XML retrieval

ACM Transactions on Information Systems (TOIS)
On the reliability of information retrieval metrics based on graded relevance

Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Bootstrap-Based comparisons of IR metrics for finding one relevant document

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper compares 14 metrics designed for information retrieval evaluation with graded relevance, together with 10 traditional metrics based on binary relevance, in terms of reliability and resemblance of system rankings. More specifically, we use two test collections with submitted runs from the Chinese IR and English IR tasks in the NTCIR-3 CLIR track to examine the metrics using methods proposed by Buckley/Voorhees and Voorhees/Buckley as well as Kendall’s rank correlation. Our results show that AnDCGl and nDCGl ((Average) Normalised Discounted Cumulative Gain at Document cut-off l) are good metrics, provided that l is large. However, if one wants to avoid the parameter l altogether, or if one requires a metric that closely resembles TREC Average Precision, then Q-measure appears to be the best choice.