Contextual and dimensional relevance judgments for reusable SERP-level evaluation

Authors:
Peter B. Golbus;Imed Zitouni;Jin Young Kim;Ahmed Hassan;Fernando Diaz
Affiliations:
Northeastern University, Boston, MA, USA;Bing, Redmond, WA, USA;Bing, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 23rd international conference on World wide web
Year:
2014

Citing 24
Cited 0

Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
The concept of relevance in IR

Journal of the American Society for Information Science and Technology
Evaluating implicit measures to improve web search

ACM Transactions on Information Systems (TOIS)
Evaluation by comparing result sets in context

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The relationship between IR effectiveness measures and user satisfaction

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Relevance: A review of the literature and a framework for thinking on the notion in information science. Part III: Behavior and effects of relevance

Journal of the American Society for Information Science and Technology
An experimental comparison of click position-bias models

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
The good and the bad system: does the test collection predict users' effectiveness?

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and diversity in information retrieval evaluation

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness

ACM Transactions on Information Systems (TOIS)
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Expected reciprocal rank for graded relevance

Proceedings of the 18th ACM conference on Information and knowledge management
Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions

ECIR'07 Proceedings of the 29th European conference on IR research
The effect of assessor error on IR system evaluation

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Do user preferences and evaluation measures line up?

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context

Proceedings of the third symposium on Information interaction in context
System effectiveness, user models, and user utility: a conceptual framework for investigation

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evaluating diversified search results using per-intent graded relevance

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Using preference judgments for novel document retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
The effect of threshold priming and need for cognition on relevance calibration and assessment

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Preference based evaluation measures for novelty and diversity

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Relevance dimensions in preference-based IR evaluation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Enhancing personalized search by mining and modeling task behavior

Proceedings of the 22nd international conference on World Wide Web
Increasing evaluation sensitivity to diversity

Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document-level relevance judgments are a major component in the calculation of effectiveness metrics. Collecting high-quality judgments is therefore a critical step in information retrieval evaluation. However, the nature of and the assumptions underlying relevance judgment collection have not received much attention. In particular, relevance judgments are typically collected for each document in isolation, although users read each document in the context of other documents. In this work, we aim to investigate the nature of relevance judgment collection. We collect relevance labels in both isolated and conditional setting, and ask for judgments in various dimensions of relevance as well as overall relevance. Then we compare the relevance metrics based on various types of judgments with other metrics of quality such as user preference. Our analyses illuminate how these settings for judgment collection affect the quality and the characteristics of the judgments. We also find that the metrics based on conditional judgments show higher correlation with user preference than isolated judgments.