Cumulated gain-based evaluation of IR techniques
ACM Transactions on Information Systems (TOIS)
The concept of relevance in IR
Journal of the American Society for Information Science and Technology
Evaluating implicit measures to improve web search
ACM Transactions on Information Systems (TOIS)
Evaluation by comparing result sets in context
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
The relationship between IR effectiveness measures and user satisfaction
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Journal of the American Society for Information Science and Technology
An experimental comparison of click position-bias models
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
The good and the bad system: does the test collection predict users' effectiveness?
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Novelty and diversity in information retrieval evaluation
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Rank-biased precision for measurement of retrieval effectiveness
ACM Transactions on Information Systems (TOIS)
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Expected reciprocal rank for graded relevance
Proceedings of the 18th ACM conference on Information and knowledge management
Query hardness estimation using Jensen-Shannon divergence among multiple scoring functions
ECIR'07 Proceedings of the 29th European conference on IR research
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Do user preferences and evaluation measures line up?
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Evaluating search systems using result page context
Proceedings of the third symposium on Information interaction in context
System effectiveness, user models, and user utility: a conceptual framework for investigation
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Evaluating diversified search results using per-intent graded relevance
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Using preference judgments for novel document retrieval
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
The effect of threshold priming and need for cognition on relevance calibration and assessment
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Preference based evaluation measures for novelty and diversity
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Relevance dimensions in preference-based IR evaluation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Enhancing personalized search by mining and modeling task behavior
Proceedings of the 22nd international conference on World Wide Web
Increasing evaluation sensitivity to diversity
Information Retrieval
Hi-index | 0.00 |
Document-level relevance judgments are a major component in the calculation of effectiveness metrics. Collecting high-quality judgments is therefore a critical step in information retrieval evaluation. However, the nature of and the assumptions underlying relevance judgment collection have not received much attention. In particular, relevance judgments are typically collected for each document in isolation, although users read each document in the context of other documents. In this work, we aim to investigate the nature of relevance judgment collection. We collect relevance labels in both isolated and conditional setting, and ask for judgments in various dimensions of relevance as well as overall relevance. Then we compare the relevance metrics based on various types of judgments with other metrics of quality such as user preference. Our analyses illuminate how these settings for judgment collection affect the quality and the characteristics of the judgments. We also find that the metrics based on conditional judgments show higher correlation with user preference than isolated judgments.