Information Processing and Management: an International Journal
The concept of relevance in IR
Journal of the American Society for Information Science and Technology
Relevance judgment: What do information users consider beyond topicality?
Journal of the American Society for Information Science and Technology - Research Articles
Evaluation by comparing result sets in context
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The relationship between IR effectiveness measures and user satisfaction
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Journal of the American Society for Information Science and Technology
Here or there: preference judgments for relevance
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Do user preferences and evaluation measures line up?
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Comparing the sensitivity of information retrieval metrics
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context
Proceedings of the third symposium on Information interaction in context
Using preference judgments for novel document retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Contextual and dimensional relevance judgments for reusable SERP-level evaluation
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Evaluation of information retrieval (IR) systems has recently been exploring the use of preference judgments over two search result lists. Unlike the traditional method of collecting relevance labels per single result, this method allows to consider the interaction between search results as part of the judging criteria. For example, one result list may be preferred over another if it has a more diverse set of relevant results, covering a wider range of user intents. In this paper, we investigate how assessors determine their preference for one list of results over another with the aim to understand the role of various relevance dimensions in preference-based evaluation. We run a series of experiments and collect preference judgments over different relevance dimensions in side-by-side comparisons of two search result lists, as well as relevance judgments for the individual documents. Our analysis of the collected judgments reveals that preference judgments combine multiple dimensions of relevance that go beyond the traditional notion of relevance centered on topicality. Measuring performance based on single document judgments and NDCG aligns well with topicality based preferences, but shows misalignment with judges' overall preferences, largely due to the diversity dimension. As a judging method, dimensional preference judging is found to lead to improved judgment quality.