How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic evaluation of world wide web search services
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
ACM SIGIR Forum
Using titles and category names from editor-driven taxonomies for automatic evaluation
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Implicit feedback for inferring user preference: a bibliography
ACM SIGIR Forum
Understanding user goals in web search
Proceedings of the 13th international conference on World Wide Web
Accurately interpreting clickthrough data as implicit feedback
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Analysis of the query logs of a web site search engine
Journal of the American Society for Information Science and Technology
ACM SIGIR Forum
Building simulated queries for known-item topics: an analysis using six european languages
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Bias and the limits of pooling for large collections
Information Retrieval
Determining the informational, navigational, and transactional intent of Web queries
Information Processing and Management: an International Journal
How does clickthrough data reflect retrieval quality?
Proceedings of the 17th ACM conference on Information and knowledge management
Visualizing the problems with the INEX topics
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Improving quality of training data for learning to rank using click-through data
Proceedings of the third ACM international conference on Web search and data mining
Comparing click-through data to purchase decisions for retrieval evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On caption bias in interleaving experiments
Proceedings of the 21st ACM international conference on Information and knowledge management
Estimating interleaved comparison outcomes from historical click data
Proceedings of the 21st ACM international conference on Information and knowledge management
An analysis of human factors and label accuracy in crowdsourcing relevance judgments
Information Retrieval
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods
ACM Transactions on Information Systems (TOIS)
Relative confidence sampling for efficient on-line ranker evaluation
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Queries and click-through data taken from search engine transaction logs is an attractive alternative to traditional test collections, due to its volume and the direct relation to end-user querying. The overall aim of this paper is to answer the question: How does click-through data differ from explicit human relevance judgments in information retrieval evaluation? We compare a traditional test collection with manual judgments to transaction log based test collections---by using queries as topics and subsequent clicks as pseudo-relevance judgments for the clicked results. Specifically, we investigate the following two research questions: Firstly, are there significant differences between clicks and relevance judgments. Earlier research suggests that although clicks and explicit judgments show reasonable agreement, clicks are different from static absolute relevance judgments. Secondly, are there significant differences between system ranking based on clicks and based on relevance judgments? This is an open question, but earlier research suggests that comparative evaluation in terms of system ranking is remarkably robust.