CrowdER: crowdsourcing entity resolution

Authors:
Jiannan Wang;Tim Kraska;Michael J. Franklin;Jianhua Feng
Affiliations:
AMPLab, UC Berkeley and Tsinghua University;AMPLab, UC Berkeley;AMPLab, UC Berkeley;Tsinghua University
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 19
Cited 19

Approximation Algorithms for the k-Clique Covering Problem

SIAM Journal on Discrete Mathematics
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Febrl: a freely available record linkage system with a graphical user interface

HDKM '08 Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80
Matching Schemas in Online Communities: A Web 2.0 Approach

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
On active learning of record matching packages

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones

Proceedings of the 8th international conference on Mobile systems, applications, and services
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Soylent: a word processor with a crowd inside

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
Evaluation of entity resolution approaches on real-world match problems

Proceedings of the VLDB Endowment
Crowdsourcing systems on the World-Wide Web

Communications of the ACM
CrowdDB: answering queries with crowdsourcing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Efficient similarity joins for near-duplicate detection

ACM Transactions on Database Systems (TODS)
Human-powered sorts and joins

Proceedings of the VLDB Endowment
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

10th international workshop on quality in databases: QDB 2012

ACM SIGMOD Record
Using the crowd for top-k and group-by queries

Proceedings of the 16th International Conference on Database Theory
Knowledge harvesting in the big-data era

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Leveraging transitive relations for crowdsourced joins

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
An online cost sensitive decision-making method in crowdsourcing systems

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
YaLi: a crowdsourcing plug-in for NERD

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Towards a generic framework for trustworthy spatial crowdsourcing

Proceedings of the 12th International ACM Workshop on Data Engineering for Wireless and Mobile Acess
Evaluating the crowd with confidence

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
WiseMarket: a new paradigm for managing wisdom of online social users

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Optimizing plurality for human intelligence tasks

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
User-driven quality evaluation of DBpedia

Proceedings of the 9th International Conference on Semantic Systems
Big data challenge: a data management perspective

Frontiers of Computer Science: Selected Publications from Chinese Universities
Question selection for crowd entity resolution

Proceedings of the VLDB Endowment
Answering planning queries with the crowd

Proceedings of the VLDB Endowment
Reducing uncertainty of schema matching via crowdsourcing

Proceedings of the VLDB Endowment
Crowdsourcing-assisted query structure interpretation

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Large-scale linked data integration using probabilistic reasoning and crowdsourcing

The VLDB Journal — The International Journal on Very Large Data Bases
Hybrid entity clustering using crowds and data

The VLDB Journal — The International Journal on Very Large Data Bases
Learning an accurate entity resolution model from crowdsourced labels

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.