Evaluating new search engine configurations with pre-existing judgments and clicks

Authors:
Umut Ozertem;Rosie Jones;Benoit Dumoulin
Affiliations:
Yahoo Labs, Sunnyvale, CA, USA;Akamai Technologies, Boston, MA, USA;Microsoft, Mountain View, CA, USA
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 14
Cited 2

Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Retrieval evaluation with incomplete information

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Minimal test collections for retrieval evaluation

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A comparison of pooled and sampled relevance judgments

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Active exploration for learning rankings from clickthrough data

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Search Engines that Learn from Implicit Feedback

Computer
An experimental comparison of click position-bias models

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
A user browsing model to predict search engine click data from past observations.

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Mining user web search activity with layered bayesian networks or how to capture a click in its context

Proceedings of the Second ACM International Conference on Web Search and Data Mining
A dynamic bayesian network click model for web search ranking

Proceedings of the 18th international conference on World wide web
Expected reciprocal rank for graded relevance

Proceedings of the 18th ACM conference on Information and knowledge management

Estimating interleaved comparison outcomes from historical click data

Proceedings of the 21st ACM international conference on Information and knowledge management
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We provide a novel method of evaluating search results, which allows us to combine existing editorial judgments with the relevance estimates generated by click-based user browsing models. There are evaluation methods in the literature that use clicks and editorial judgments together, but our approach is novel in the sense that it allows us to predict the impact of unseen search models without online tests to collect clicks and without requesting new editorial data, since we are only re-using existing editorial data, and clicks observed for previous result set configurations. Since the user browsing model and the pre-existing editorial data cannot provide relevance estimates for all documents for the selected set of queries, one important challenge is to obtain this performance estimation where there are a lot of ranked documents with missing relevance values. We introduce a query and rank based smoothing to overcome this problem. We show that a hybrid of these smoothing techniques performs better than both query and position based smoothing, and despite the high percentage of missing judgments, the resulting method is significantly correlated (0.74) with DCG values evaluated using fully judged datasets, and approaches inter-annotator agreement. We show that previously published techniques, applicable to frequent queries, degrade when applied to a random sample of queries, with a correlation of only 0.29. While our experiments focus on evaluation using DCG, our method is also applicable to other commonly used metrics.