Large-Scale Deduplication with Constraints Using Dedupalog

Authors:
Arvind Arasu;Christopher Ré;Dan Suciu
Affiliations:
-;-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 23

Creating probabilistic databases from duplicated data

The VLDB Journal — The International Journal on Very Large Data Bases
Robust record linkage blocking using suffix arrays

Proceedings of the 18th ACM conference on Information and knowledge management
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Reasoning about record matching rules

Proceedings of the VLDB Endowment
Clustering with diversity

ICALP'10 Proceedings of the 37th international colloquium conference on Automata, languages and programming
EIF: a framework of effective entity identification

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Large-scale collective entity matching

Proceedings of the VLDB Endowment
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Dynamic constraints for record matching

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient duplicate detection on cloud using a new signature scheme

WAIM'11 Proceedings of the 12th international conference on Web-age information management
PARIS: probabilistic alignment of relations, instances, and schema

Proceedings of the VLDB Endowment
Theoretical foundations for enabling a web of knowledge

FoIKS'10 Proceedings of the 6th international conference on Foundations of Information and Knowledge Systems
Fast and accurate incremental entity resolution relative to an entity knowledge base

Proceedings of the 21st ACM international conference on Information and knowledge management
Query rewriting using datalog for duplicate resolution

Datalog 2.0'12 Proceedings of the Second international conference on Datalog in Academia and Industry
HIL: a high-level scripting language for entity integration

Proceedings of the 16th International Conference on Extending Database Technology
GRDB: a system for declarative and interactive analysis of noisy information networks

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Tuning large scale deduplication with reduced effort

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
SIGMa: simple greedy matching for aligning large knowledge bases

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering keys in RDF/OWL dataset with KD2R

Proceedings of the 2nd International Workshop on Open Data
WOO: a scalable and multi-tenant platform for continuous knowledge base synthesis

Proceedings of the VLDB Endowment
Discovering linkage points over web data

Proceedings of the VLDB Endowment
An automatic key discovery approach for data linking

Web Semantics: Science, Services and Agents on the World Wide Web
Joint entity resolution on multiple datasets

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is "each paper has a unique publication venue''; if two paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalog-style language with precise semantics. Most previous work on deduplication either ignoreconstraints or use them in an ad-hoc domain-specific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thoroughexperimental results over real-world data demonstrating the utility of our framework for high-quality and scalable deduplication.