Comparative analysis of clicks and judgments for IR evaluation

Authors:
Jaap Kamps;Marijn Koolen;Andrew Trotman
Affiliations:
University of Amsterdam, Amsterdam, The Netherlands;University of Amsterdam, Amsterdam, The Netherlands;University of Otago, Dunedin, New Zealand
Venue:
Proceedings of the 2009 workshop on Web Search Click Data
Year:
2009

Citing 16
Cited 8

How reliable are the results of large-scale information retrieval experiments?

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of world wide web search services

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
A taxonomy of web search

ACM SIGIR Forum
Using titles and category names from editor-driven taxonomies for automatic evaluation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Implicit feedback for inferring user preference: a bibliography

ACM SIGIR Forum
SIGIR 2003 workshop report: implicit measures of user interests and preferences

ACM SIGIR Forum
Understanding user goals in web search

Proceedings of the 13th international conference on World Wide Web
Accurately interpreting clickthrough data as implicit feedback

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of the query logs of a web site search engine

Journal of the American Society for Information Science and Technology
The Wikipedia XML corpus

ACM SIGIR Forum
Building simulated queries for known-item topics: an analysis using six european languages

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling for large collections

Information Retrieval
Determining the informational, navigational, and transactional intent of Web queries

Information Processing and Management: an International Journal
How does clickthrough data reflect retrieval quality?

Proceedings of the 17th ACM conference on Information and knowledge management

Visualizing the problems with the INEX topics

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Improving quality of training data for learning to rank using click-through data

Proceedings of the third ACM international conference on Web search and data mining
Comparing click-through data to purchase decisions for retrieval evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On caption bias in interleaving experiments

Proceedings of the 21st ACM international conference on Information and knowledge management
Estimating interleaved comparison outcomes from historical click data

Proceedings of the 21st ACM international conference on Information and knowledge management
An analysis of human factors and label accuracy in crowdsourcing relevance judgments

Information Retrieval
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

ACM Transactions on Information Systems (TOIS)
Relative confidence sampling for efficient on-line ranker evaluation

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Queries and click-through data taken from search engine transaction logs is an attractive alternative to traditional test collections, due to its volume and the direct relation to end-user querying. The overall aim of this paper is to answer the question: How does click-through data differ from explicit human relevance judgments in information retrieval evaluation? We compare a traditional test collection with manual judgments to transaction log based test collections---by using queries as topics and subsequent clicks as pseudo-relevance judgments for the clicked results. Specifically, we investigate the following two research questions: Firstly, are there significant differences between clicks and relevance judgments. Earlier research suggests that although clicks and explicit judgments show reasonable agreement, clicks are different from static absolute relevance judgments. Secondly, are there significant differences between system ranking based on clicks and based on relevance judgments? This is an open question, but earlier research suggests that comparative evaluation in terms of system ranking is remarkably robust.