An analysis of crowd workers mistakes for specific and complex relevance assessment task

Authors:
Jesse Anderton;Maryam Bashir;Virgil Pavlu;Javed A. Aslam
Affiliations:
Northeastern University, Boston, Massachusetts, USA;Northeastern University, Boston, Massachusetts, USA;Northeastern University, Boston, Massachusetts, USA;Northeastern University, Boston, Massachusetts, USA
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 3
Cited 0

Bias and the limits of pooling

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
On aggregating labels from multiple crowd workers to infer relevance of documents

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The TREC 2012 Crowdsourcing track asked participants to crowdsource relevance assessments with the goal of replicating costly expert judgements with relatively fast, inexpensive, but less reliable judgements from anonymous online workers. The track used 10 "ad-hoc" queries, highly specific and complex (as compared to web search). The crowdsourced assessments were evaluated against expert judgments made by highly trained and capable human analysts in 1999 as part of ad hoc track collection construction. Since most crowdsourcing approaches submitted to the TREC 2012 track produced assessment sets nowhere close to the expert judgements, we decided to analyze crowdsourcing mistakes made on this task using data we collected via Amazon's Mechanical Turk service. We investigate two types of crowdsourcing approaches: one that asks for nominal relevance grades for each document, and the other that asks for preferences on many (not all) pairs of documents.