Efficient construction of large test collections
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
How reliable are the results of large-scale information retrieval experiments?
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Variations in relevance judgments and the measurement of retrieval effectiveness
Information Processing and Management: an International Journal
The effect of topic set size on retrieval experiment error
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Redundant documents and search effectiveness
Proceedings of the 14th ACM international conference on Information and knowledge management
A comparison of statistical significance tests for information retrieval evaluation
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Relevance assessment: are judges exchangeable and does it matter
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Statistical power in retrieval experimentation
Proceedings of the 17th ACM conference on Information and knowledge management
Handbook of Parametric and Nonparametric Statistical Procedures
Handbook of Parametric and Nonparametric Statistical Procedures
The effect of assessor error on IR system evaluation
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Assessor error in stratified evaluation
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Quality through flow and immersion: gamifying crowdsourced relevance assessments
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
An analysis of systematic judging errors in information retrieval
Proceedings of the 21st ACM international conference on Information and knowledge management
Differences in effectiveness across sub-collections
Proceedings of the 21st ACM international conference on Information and knowledge management
Deciding on an adjustment for multiplicity in IR experiments
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
The effect of threshold priming and need for cognition on relevance calibration and assessment
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Exploiting user disagreement for web search evaluation: an experimental approach
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
Relevance assessments are a key component for test collection-based evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. A wide range of test collections are retrospectively examined to determine how accurately assessors judge the relevance of documents. Our results demonstrate a high level of inconsistency across the collections studied. The level of irregularity is shown to vary across topics, with some showing a very high level of assessment error. We investigate possible influences on the error, and demonstrate that inconsistency in judging increases with time. While the level of detail in a topic specification does not appear to influence the errors that assessors make, judgements are significantly affected by the decisions made on previously seen similar documents. Assessors also display an assessment inertia. Alternate approaches to generating relevance judgements appear to reduce errors. A further investigation of the way that retrieval systems are ranked using sets of relevance judgements produced early and late in the judgement process reveals a consistent influence measured across the majority of examined test collections. We conclude that there is a clear value in examining, even inserting, ground truth data in test collections, and propose ways to help minimise the sources of inconsistency when creating future test collections.