Exploiting user disagreement for web search evaluation: an experimental approach

Authors:
Thomas Demeester;Robin Aly;Djoerd Hiemstra;Dong Nguyen;Dolf Trieschnigg;Chris Develder
Affiliations:
Ghent University - iMinds, Ghent, Belgium;University of Twente, Enschede, Netherlands;University of Twente, Enschede, Netherlands;University of Twente, Enschede, Netherlands;University of Twente, Enschede, Netherlands;Ghent University - iMinds, Ghent, Belgium
Venue:
Proceedings of the 7th ACM international conference on Web search and data mining
Year:
2014

Citing 20
Cited 0

Variations in relevance judgments and the evaluation of retrieval performance

Information Processing and Management: an International Journal
Variations in relevance judgments and the measurement of retrieval effectiveness

Information Processing and Management: an International Journal
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Using graded relevance assessments in IR evaluation

Journal of the American Society for Information Science and Technology
Beyond independent relevance: methods and evaluation metrics for subtopic retrieval

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
A new rank correlation coefficient for information retrieval

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Including summaries in system evaluation

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
An Effectiveness Measure for Ambiguous and Underspecified Queries

ICTIR '09 Proceedings of the 2nd International Conference on Theory of Information Retrieval: Advances in Information Retrieval Theory
Analyzing and evaluating query reformulation strategies in web search logs

Proceedings of the 18th ACM conference on Information and knowledge management
Empirical justification of the gain and discount function for nDCG

Proceedings of the 18th ACM conference on Information and knowledge management
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Extending average precision to graded relevance judgments

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Expected browsing utility for web search evaluation

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Quantifying test collection quality based on the consistency of relevance judgements

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Modeling user variance in time-biased gain

Proceedings of the Symposium on Human-Computer Interaction and Information Retrieval
Federated search in the wild: the combined power of over a hundred search engines

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

To express a more nuanced notion of relevance as compared to binary judgments, graded relevance levels can be used for the evaluation of search results. Especially in Web search, users strongly prefer top results over less relevant results, and yet they often disagree on which are the top results for a given information need. Whereas previous works have generally considered disagreement as a negative effect, this paper proposes a method to exploit this user disagreement by integrating it into the evaluation procedure. First, we present experiments that investigate the user disagreement. We argue that, with a high disagreement, lower relevance levels might need to be promoted more than in the case where there is global consensus on the top results. This is formalized by introducing the User Disagreement Model, resulting in a weighting of the relevance levels with a probabilistic interpretation. A validity analysis is given, and we explain how to integrate the model with well-established evaluation metrics. Finally, we discuss a specific application of the model, in the estimation of suitable weights for the combined relevance of Web search snippets and pages.