Human performance and retrieval precision revisited
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
On aggregating labels from multiple crowd workers to infer relevance of documents
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Time to judge relevance as an indicator of assessor error
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using crowdsourcing for TREC relevance assessment
Information Processing and Management: an International Journal
Implementing crowdsourcing-based relevance experimentation: an industrial perspective
Information Retrieval
Is relevance hard work?: evaluating the effort of making relevant assessments
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
In many situations, humans judging document relevance are forced to trade-off accuracy for speed. The development of better interactive retrieval systems and relevance assessing platforms requires the measurement of assessor accuracy, but to date the subjective nature of relevance has prevented such measurement. To quantify assessor performance, we define relevance to be a group's majority opinion, and demonstrate the value of this approach by comparing the performance of NIST assessors to a group of assessors representative of participants in many information retrieval user studies. Using data collected as part of a user study with 48 participants, we found that NIST assessors discriminate between relevant and non-relevant documents better than the average participant in our study, but that NIST assessors' true positive rate is no better than that of the study participants. In addition, we found NIST assessors to be conservative in their judgment of relevance compared to the average participant.