Leveraging transitive relations for crowdsourced joins

Authors:
Jiannan Wang;Guoliang Li;Tim Kraska;Michael J. Franklin;Jianhua Feng
Affiliations:
Department of Computer Science, Tsinghua University, Beijing, China;Department of Computer Science, Tsinghua University, Beijing, China;Brown University, Providence, USA;AMPLab, UC Berkeley, Berkeley, USA;Department of Computer Science, Tsinghua University, Beijing, China
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 17
Cited 1

Efficiency of a Good But Not Linear Set Union Algorithm

Journal of the ACM (JACM)
Correlation Clustering

Machine Learning
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Entity resolution with iterative blocking

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Exploring iterative and parallel human computation processes

Proceedings of the ACM SIGKDD Workshop on Human Computation
TurKit: human computation algorithms on mechanical turk

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Human-assisted graph search: it's okay to ask questions

Proceedings of the VLDB Endowment
CrowdDB: answering queries with crowdsourcing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
CrowdForge: crowdsourcing complex work

Proceedings of the 24th annual ACM symposium on User interface software and technology
Human-powered sorts and joins

Proceedings of the VLDB Endowment
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web
Max algorithms in crowdsourcing environments

Proceedings of the 21st international conference on World Wide Web
CrowdScreen: algorithms for filtering data with humans

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
So who won?: dynamic max discovery with the crowd

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
CDAS: a crowdsourcing data analytics system

Proceedings of the VLDB Endowment
CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment

A human-machine method for web table understanding

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The development of crowdsourced query processing systems has recently attracted a significant attention in the database community. A variety of crowdsourced queries have been investigated. In this paper, we focus on the crowdsourced join query which aims to utilize humans to find all pairs of matching objects from two collections. As a human-only solution is expensive, we adopt a hybrid human-machine approach which first uses machines to generate a candidate set of matching pairs, and then asks humans to label the pairs in the candidate set as either matching or non-matching. Given the candidate pairs, existing approaches will publish all pairs for verification to a crowdsourcing platform. However, they neglect the fact that the pairs satisfy transitive relations. As an example, if o1 matches with o2, and o2 matches with o3, then we can deduce that o1 matches with o3 without needing to crowdsource (o1, o3). To this end, we study how to leverage transitive relations for crowdsourced joins. We propose a hybrid transitive-relations and crowdsourcing labeling framework which aims to crowdsource the minimum number of pairs to label all the candidate pairs. We prove the optimal labeling order and devise a parallel labeling algorithm to efficiently crowdsource the pairs following the order. We evaluate our approaches in both simulated environment and a real crowdsourcing platform. Experimental results show that our approaches with transitive relations can save much more money and time than existing methods, with a little loss in the result quality.