Metric and Relevance Mismatch in Retrieval Evaluation

Authors:
Falk Scholer;Andrew Turpin
Affiliations:
School of Computer Science and IT, RMIT University, Melbourne, Australia;School of Computer Science and IT, RMIT University, Melbourne, Australia
Venue:
AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Year:
2009

Citing 17
Cited 0

Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating evaluation measure stability

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Why batch and user evaluations do not give the same results

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Liberal relevance criteria of TREC -: counting on negligible documents?

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
From E-Sex to E-Commerce: Web Search Changes

Computer
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web
The influence of relevance levels on the effectiveness of interactive information retrieval

Journal of the American Society for Information Science and Technology
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
The Turn: Integration of Information Seeking and Retrieval in Context (The Information Retrieval Series)

The Turn: Integration of Information Seeking and Retrieval in Context (The Information Retrieval Series)
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Fast generation of result snippets in web search

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Handbook of Parametric and Nonparametric Statistical Procedures

Handbook of Parametric and Nonparametric Statistical Procedures

Quantified Score

Hi-index	0.01

Visualization

Abstract

Recent investigations of search performance have shown that, even when presented with two systems that are superior and inferior based on a Cranfield-style batch experiment, real users may perform equally well with either system. In this paper, we explore how these evaluation paradigms may be reconciled. First, we investigate the DCG@1 and P@1 metrics, and their relationship with user performance on a common web search task. Our results show that batch experiment predictions based on P@1 or DCG@1 translate directly to user search effectiveness. However, marginally relevant documents are not strongly differentiable from non-relevant documents. Therefore, when folding multiple relevance levels into a binary scale, marginally relevant documents should be grouped with non-relevant documents, rather than with highly relevant documents, as is currently done in standard IR evaluations. We then investigate relevance mismatch, classifying users based on relevance profiles, the likelihood with which they will judge documents of different relevance levels to be useful. When relevance profiles can be estimated well, this classification scheme can offer further insight into the transferability of batch results to real user search tasks.