Do batch and user evaluations give the same results?
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Optimizing search engines using clickthrough data
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ACM SIGIR Forum
On the Evolution of Clusters of Near-Duplicate Web Pages
LA-WEB '03 Proceedings of the First Conference on Latin American Web Congress
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
When will information retrieval be "good enough"?
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
User performance versus precision measures for simple search tasks
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation by comparing result sets in context
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
How well does result relevance predict session satisfaction?
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
The good and the bad system: does the test collection predict users' effectiveness?
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How does clickthrough data reflect retrieval quality?
Proceedings of the 17th ACM conference on Information and knowledge management
Search Engines: Information Retrieval in Practice
Search Engines: Information Retrieval in Practice
Learning more powerful test statistics for click-based retrieval evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Learning more powerful test statistics for click-based retrieval evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context
Proceedings of the third symposium on Information interaction in context
Personalizing web search using long term browsing history
Proceedings of the fourth ACM international conference on Web search and data mining
Using graph based method to improve bootstrapping relation extraction
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Balancing exploration and exploitation in learning to rank online
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
Evaluating diversified search results using per-intent graded relevance
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Proceedings of the 20th ACM international conference on Information and knowledge management
Finding images of difficult entities in the long tail
Proceedings of the 20th ACM international conference on Information and knowledge management
A probabilistic method for inferring preferences from clicks
Proceedings of the 20th ACM international conference on Information and knowledge management
Large-scale validation and analysis of interleaved search evaluation
ACM Transactions on Information Systems (TOIS)
On caption bias in interleaving experiments
Proceedings of the 21st ACM international conference on Information and knowledge management
Estimating interleaved comparison outcomes from historical click data
Proceedings of the 21st ACM international conference on Information and knowledge management
Reusing historical interaction data for faster online learning to rank for IR
Proceedings of the sixth ACM international conference on Web search and data mining
Optimized interleaving for online retrieval evaluation
Proceedings of the sixth ACM international conference on Web search and data mining
Practical online retrieval evaluation
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
User model-based metrics for offline query suggestion evaluation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Click model-based information retrieval metrics
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Relevance dimensions in preference-based IR evaluation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Using historical click data to increase interleaving sensitivity
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods
ACM Transactions on Information Systems (TOIS)
Relative confidence sampling for efficient on-line ranker evaluation
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Information retrieval effectiveness is usually evaluated using measures such as Normalized Discounted Cumulative Gain (NDCG), Mean Average Precision (MAP) and Precision at some cutoff (Precision@k) on a set of judged queries. Recent research has suggested an alternative, evaluating information retrieval systems based on user behavior. Particularly promising are experiments that interleave two rankings and track user clicks. According to a recent study, interleaving experiments can identify large differences in retrieval effectiveness with much better reliability than other click-based methods. We study interleaving in more detail, comparing it with traditional measures in terms of reliability, sensitivity and agreement. To detect very small differences in retrieval effectiveness, a reliable outcome with standard metrics requires about 5,000 judged queries, and this is about as reliable as interleaving with 50,000 user impressions. Amongst the traditional measures, NDCG has the strongest correlation with interleaving. Finally, we present some new forms of analysis, including an approach to enhance interleaving sensitivity.