Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Variations in relevance judgments and the measurement of retrieval effectiveness
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Ranking retrieval systems without relevance judgments
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
The Philosophy of Information Retrieval Evaluation
CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Information retrieval system evaluation: effort, sensitivity, and reliability
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Minimal test collections for retrieval evaluation
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
A statistical method for system evaluation using incomplete judgments
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating average precision with incomplete and imperfect judgments
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Evaluation over thousands of queries
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Relevance assessment: are judges exchangeable and does it matter
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
How evaluator domain expertise affects search result relevance judgments
Proceedings of the 17th ACM conference on Information and knowledge management
Quantifying test collection quality based on the consistency of relevance judgements
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Worker types and personality traits in crowdsourcing relevance labels
Proceedings of the 20th ACM international conference on Information and knowledge management
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
An uncertainty-aware query selection model for evaluation of IR systems
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Impact of assessor disagreement on ranking performance
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Time to judge relevance as an indicator of assessor error
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
An analysis of systematic judging errors in information retrieval
Proceedings of the 21st ACM international conference on Information and knowledge management
Alternative assessor disagreement and retrieval depth
Proceedings of the 21st ACM international conference on Information and knowledge management
Constructing test collections by inferring document relevance via extracted relevant information
Proceedings of the 21st ACM international conference on Information and knowledge management
An analysis of human factors and label accuracy in crowdsourcing relevance judgments
Information Retrieval
Implementing crowdsourcing-based relevance experimentation: an industrial perspective
Information Retrieval
The effect of threshold priming and need for cognition on relevance calibration and assessment
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Assessor disagreement and text classifier accuracy
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Document features predicting assessor disagreement
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Live nuggets extractor: a semi-automated system for text extraction and test collection creation
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Exploiting user disagreement for web search evaluation: an experimental approach
Proceedings of the 7th ACM international conference on Web search and data mining
Contextual and dimensional relevance judgments for reusable SERP-level evaluation
Proceedings of the 23rd international conference on World wide web
Evaluation in Music Information Retrieval
Journal of Intelligent Information Systems
Hi-index | 0.00 |
Recent efforts in test collection building have focused on scaling back the number of necessary relevance judgments and then scaling up the number of search topics. Since the largest source of variation in a Cranfield-style experiment comes from the topics, this is a reasonable approach. However, as topic set sizes grow, and researchers look to crowdsourcing and Amazon's Mechanical Turk to collect relevance judgments, we are faced with issues of quality control. This paper examines the robustness of the TREC Million Query track methods when some assessors make significant and systematic errors. We find that while averages are robust, assessor errors can have a large effect on system rankings.