On the reliability of information retrieval metrics based on graded relevance

Authors:
Tetsuya Sakai
Affiliations:
Toshiba Corporate R&D Center, Knowledge Media Laboratory, Saiwai-ku, Kawasaki, Japan
Venue:
Information Processing and Management: an International Journal - Special issue: AIRS2005: Information retrieval research in Asia
Year:
2007

Citing 15
Cited 19

Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Information storage and retrieval

Information storage and retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by highly relevant documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Average gain ratio: a simple retrieval performance measure for evaluation with multiple relevance levels

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Measuring retrieval effectiveness: a new proposal and a first experimental validation

Journal of the American Society for Information Science and Technology
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Binary and graded relevance in IR evaluations: comparison of the effects on ranking of IR systems

Information Processing and Management: an International Journal
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Give me just one highly relevant document: P-measure

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The reliability of metrics based on graded relevance

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology
On effectiveness measures and relevance functions in ranking INEX systems

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

Evaluating evaluation metrics based on the bootstrap

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Give me just one highly relevant document: P-measure

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
On the reliability of factoid question answering evaluation

ACM Transactions on Asian Language Information Processing (TALIP)
Alternatives to Bpref

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Test theory for evaluating reliability of IR test collections

Information Processing and Management: an International Journal
On information retrieval metrics designed for evaluation with incomplete relevance assessments

Information Retrieval
Comparing metrics across TREC and NTCIR: the robustness to system bias

Proceedings of the 17th ACM conference on Information and knowledge management
Building a framework for the probability ranking principle by a family of expected weighted rank

ACM Transactions on Information Systems (TOIS)
A few good topics: Experiments in topic set reduction for retrieval evaluation

ACM Transactions on Information Systems (TOIS)
Extracting learning concepts from educational texts in intelligent tutoring systems automatically

Expert Systems with Applications: An International Journal
A simple measure to assess non-response

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Retrieval of logically relevant 3D human motions by Adaptive Feature Selection with Graded Relevance Feedback

Pattern Recognition Letters
Bootstrap-Based comparisons of IR metrics for finding one relevant document

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Measures for benchmarking semantic web service matchmaking correctness

ESWC'10 Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II
Ranking, relevance judgment, and precision of information retrieval on children's queries: Evaluation of Google, Yahoo!, Bing, Yahoo! Kids, and ask Kids

Journal of the American Society for Information Science and Technology
Evaluating question answering validation as a classification problem

Language Resources and Evaluation
On the measurement of test collection reliability

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Tune and mix: learning to rank using ensembles of calibrated multi-class classifiers

Machine Learning
Evaluation in Music Information Retrieval

Journal of Intelligent Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.