Question selection for crowd entity resolution

Authors:
Steven Euijong Whang;Peter Lofgren;Hector Garcia-Molina
Affiliations:
Google Inc. and Stanford University;Stanford University;Stanford University
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 15
Cited 1

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Correlation Clustering

Machine Learning
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Framework for evaluating clustering algorithms in duplicate detection

Proceedings of the VLDB Endowment
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
CrowdDB: answering queries with crowdsourcing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Demonstration of Qurk: a query processor for humanoperators

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distributed human computation framework for linked data co-reference resolution

ESWC'11 Proceedings of the 8th extended semantic web conference on The semantic web: research and applications - Volume Part I
Human-powered sorts and joins

Proceedings of the VLDB Endowment
Human Computation

Human Computation
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web
CrowdER: crowdsourcing entity resolution

Proceedings of the VLDB Endowment
Deco: a system for declarative crowdsourcing

Proceedings of the VLDB Endowment
Quality control for comparison microtasks

Proceedings of the First International Workshop on Crowdsourcing and Data Mining

Hybrid entity clustering using crowds and data

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. We study the problem of resolving records with crowdsourcing where we ask questions to humans in order to guide ER into producing accurate results. Since human work is costly, our goal is to ask as few questions as possible. We propose a probabilistic framework for ER that can be used to estimate how much ER accuracy we obtain by asking each question and select the best question with the highest expected accuracy. Computing the expected accuracy is #P-hard, so we propose approximation techniques for efficient computation. We evaluate our best question algorithms on real and synthetic datasets and demonstrate how we can obtain high ER accuracy while significantly reducing the number of questions asked to humans.