The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Machine Learning
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Framework for evaluating clustering algorithms in duplicate detection
Proceedings of the VLDB Endowment
On active learning of record matching packages
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Quality management on Amazon Mechanical Turk
Proceedings of the ACM SIGKDD Workshop on Human Computation
CrowdDB: answering queries with crowdsourcing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Demonstration of Qurk: a query processor for humanoperators
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distributed human computation framework for linked data co-reference resolution
ESWC'11 Proceedings of the 8th extended semantic web conference on The semantic web: research and applications - Volume Part I
Proceedings of the VLDB Endowment
Human Computation
Proceedings of the 21st international conference on World Wide Web
CrowdER: crowdsourcing entity resolution
Proceedings of the VLDB Endowment
Deco: a system for declarative crowdsourcing
Proceedings of the VLDB Endowment
Quality control for comparison microtasks
Proceedings of the First International Workshop on Crowdsourcing and Data Mining
Hybrid entity clustering using crowds and data
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
We study the problem of enhancing Entity Resolution (ER) with the help of crowdsourcing. ER is the problem of clustering records that refer to the same real-world entity and can be an extremely difficult process for computer algorithms alone. For example, figuring out which images refer to the same person can be a hard task for computers, but an easy one for humans. We study the problem of resolving records with crowdsourcing where we ask questions to humans in order to guide ER into producing accurate results. Since human work is costly, our goal is to ask as few questions as possible. We propose a probabilistic framework for ER that can be used to estimate how much ER accuracy we obtain by asking each question and select the best question with the highest expected accuracy. Computing the expected accuracy is #P-hard, so we propose approximation techniques for efficient computation. We evaluate our best question algorithms on real and synthetic datasets and demonstrate how we can obtain high ER accuracy while significantly reducing the number of questions asked to humans.