NADEEF: a commodity data cleaning system

Authors:
Michele Dallachiesa;Amr Ebaid;Ahmed Eldawy;Ahmed Elmagarmid;Ihab F. Ilyas;Mourad Ouzzani;Nan Tang
Affiliations:
University of Trento, Trento, Italy;Purdue University, Lafayette, USA;University of Minnesota, Minneapolis, USA;QCRI, Doha, Qatar;QCRI, Doha, Qatar;QCRI, Doha, Qatar;QCRI, Doha, Qatar
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 25
Cited 1

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Theory and Applications of Satisfiability Testing: 6th International Conference, Sat 2003, Santa Margherita Ligure, Italy, May 5-8 2003: Selected Revised Papers (Lecture Notes in Computer Science, 2919)

Theory and Applications of Satisfiability Testing: 6th International Conference, Sat 2003, Santa Margherita Ligure, Italy, May 5-8 2003: Selected Revised Papers (Lecture Notes in Computer Science, 2919)
Answer sets for consistent query answering in inconsistent databases

Theory and Practice of Logic Programming
A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Database repairing using updates

ACM Transactions on Database Systems (TODS)
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems (TODS)
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On approximating optimum repairs for functional dependency violations

Proceedings of the 12th International Conference on Database Theory
Swoosh: a generic approach to entity resolution

The VLDB Journal — The International Journal on Very Large Data Bases
Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications

Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications
Generic entity resolution with negative rules

The VLDB Journal — The International Journal on Very Large Data Bases
Reasoning about record matching rules

Proceedings of the VLDB Endowment
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Guided data repair

Proceedings of the VLDB Endowment
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Zchaff2004: an efficient SAT solver

SAT'04 Proceedings of the 7th international conference on Theory and Applications of Satisfiability Testing
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

Proceedings of the fifth ACM international conference on Web search and data mining
Towards certain fixes with editing rules and master data

The VLDB Journal — The International Journal on Very Large Data Bases
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

IEEE Transactions on Knowledge and Data Engineering

NADEEF: a generalized data cleaning system

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.