Counting with the crowd

Authors:
Adam Marcus;David Karger;Samuel Madden;Robert Miller;Sewoong Oh
Affiliations:
MIT, CSAIL;MIT, CSAIL;MIT, CSAIL;MIT, CSAIL;MIT, CSAIL
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 11
Cited 1

The Sybil Attack

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
The Wisdom of Crowds

The Wisdom of Crowds
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Soylent: a word processor with a crowd inside

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Non-expert correction of automatically generated relation annotations

CSLDAMT '10 Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk
CrowdDB: answering queries with crowdsourcing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Crowds in two seconds: enabling realtime crowd-powered interfaces

Proceedings of the 24th annual ACM symposium on User interface software and technology
The jabberwocky programming environment for structured social computing

Proceedings of the 24th annual ACM symposium on User interface software and technology
Human-powered sorts and joins

Proceedings of the VLDB Endowment
Who gives a tweet?: evaluating microblog content value

Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work
So who won?: dynamic max discovery with the crowd

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

The future of crowd work

Proceedings of the 2013 conference on Computer supported cooperative work

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the problem of selectivity estimation in a crowdsourced database. Specifically, we develop several techniques for using workers on a crowdsourcing platform like Amazon's Mechanical Turk to estimate the fraction of items in a dataset (e.g., a collection of photos) that satisfy some property or predicate (e.g., photos of trees). We do this without explicitly iterating through every item in the dataset. This is important in crowd-sourced query optimization to support predicate ordering and in query evaluation, when performing a GROUP BY operation with a COUNT or AVG aggregate. We compare sampling item labels, a traditional approach, to showing workers a collection of items and asking them to estimate how many satisfy some predicate. Additionally, we develop techniques to eliminate spammers and colluding attackers trying to skew selectivity estimates when using this count estimation approach. We find that for images, counting can be much more effective than sampled labeling, reducing the amount of work necessary to arrive at an estimate that is within 1% of the true fraction by up to an order of magnitude, with lower worker latency. We also find that sampled labeling outperforms count estimation on a text processing task, presumably because people are better at quickly processing large batches of images than they are at reading strings of text. Our spammer detection technique, which is applicable to both the label- and count-based approaches, can improve accuracy by up to two orders of magnitude.