Comparing the sensitivity of information retrieval metrics

Authors:
Filip Radlinski;Nick Craswell
Affiliations:
Microsoft, Cambridge, United Kingdom;Microsoft, Redmond, WA, USA
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 15
Cited 22

Do batch and user evaluations give the same results?

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A taxonomy of web search

ACM SIGIR Forum
On the Evolution of Clusters of Near-Duplicate Web Pages

LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Information retrieval system evaluation: effort, sensitivity, and reliability

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
When will information retrieval be "good enough"?

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
User performance versus precision measures for simple search tasks

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by comparing result sets in context

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
How well does result relevance predict session satisfaction?

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How does clickthrough data reflect retrieval quality?

Proceedings of the 17th ACM conference on Information and knowledge management
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Learning more powerful test statistics for click-based retrieval evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Learning more powerful test statistics for click-based retrieval evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context

Proceedings of the third symposium on Information interaction in context
Personalizing web search using long term browsing history

Proceedings of the fourth ACM international conference on Web search and data mining
Using graph based method to improve bootstrapping relation extraction

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Balancing exploration and exploitation in learning to rank online

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Evaluating diversified search results using per-intent graded relevance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A task level metric for measuring web search satisfaction and its application on improving relevance estimation

Proceedings of the 20th ACM international conference on Information and knowledge management
Finding images of difficult entities in the long tail

Proceedings of the 20th ACM international conference on Information and knowledge management
A probabilistic method for inferring preferences from clicks

Proceedings of the 20th ACM international conference on Information and knowledge management
Large-scale validation and analysis of interleaved search evaluation

ACM Transactions on Information Systems (TOIS)
On caption bias in interleaving experiments

Proceedings of the 21st ACM international conference on Information and knowledge management
Estimating interleaved comparison outcomes from historical click data

Proceedings of the 21st ACM international conference on Information and knowledge management
Reusing historical interaction data for faster online learning to rank for IR

Proceedings of the sixth ACM international conference on Web search and data mining
Optimized interleaving for online retrieval evaluation

Proceedings of the sixth ACM international conference on Web search and data mining
Practical online retrieval evaluation

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
User model-based metrics for offline query suggestion evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Click model-based information retrieval metrics

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Relevance dimensions in preference-based IR evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics

Proceedings of the 2013 Conference on the Theory of Information Retrieval
Using historical click data to increase interleaving sensitivity

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

ACM Transactions on Information Systems (TOIS)
Relative confidence sampling for efficient on-line ranker evaluation

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods. We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.