Map to humans and reduce error: crowdsourcing for deduplication applied to digital libraries

Authors:
Mihai Georgescu;Dang Duc Pham;Claudiu S. Firan;Wolfgang Nejdl;Julien Gaugaz
Affiliations:
University of Hannover, Hannover, Germany;University of Hannover, Hannover, Germany;University of Hannover, Hannover, Germany;University of Hannover, Hannover, Germany;University of Hannover, Hannover, Germany
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 14
Cited 0

Improving Generalization with Active Learning

Machine Learning - Special issue on structured connectionist systems
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Get another label? improving data quality and data mining using multiple, noisy labelers

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic Entity Linkage for Heterogeneous Information Spaces

CAiSE '08 Proceedings of the 20th international conference on Advanced Information Systems Engineering
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
A Survey of Human Computation Systems

CSE '09 Proceedings of the 2009 International Conference on Computational Science and Engineering - Volume 04
Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Quality management on Amazon Mechanical Turk

Proceedings of the ACM SIGKDD Workshop on Human Computation
Learning From Crowds

The Journal of Machine Learning Research
Opt4J: a modular framework for meta-heuristic optimization

Proceedings of the 13th annual conference on Genetic and evolutionary computation
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Detecting duplicate entities, usually by examining metadata, has been the focus of much recent work. Several methods try to identify duplicate entities, while focusing either on accuracy or on efficiency and speed - with still no perfect solution. We propose a combined layered approach for duplicate detection with the main advantage of using Crowdsourcing as a training and feedback mechanism. By using Active Learning techniques on human provided examples, we fine tune our algorithm toward better duplicate detection accuracy. We keep the training cost low by gathering training data on demand for borderline cases or for inconclusive assessments. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms. Our approach shows an improvement of 14% over the untrained setting and is at only 4% difference to the human assessors in accuracy.