Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Theory and Applications of Satisfiability Testing: 6th International Conference, Sat 2003, Santa Margherita Ligure, Italy, May 5-8 2003: Selected Revised Papers (Lecture Notes in Computer Science, 2919)
Answer sets for consistent query answering in inconsistent databases
Theory and Practice of Logic Programming
A cost-based model and effective heuristic for repairing constraints by value modification
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Database repairing using updates
ACM Transactions on Database Systems (TODS)
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Improving data quality: consistency and accuracy
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Conditional functional dependencies for capturing data inconsistencies
ACM Transactions on Database Systems (TODS)
Dependencies revisited for improving data quality
Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On approximating optimum repairs for functional dependency violations
Proceedings of the 12th International Conference on Database Theory
Swoosh: a generic approach to entity resolution
The VLDB Journal — The International Journal on Very Large Data Bases
Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications
Handbook of Satisfiability: Volume 185 Frontiers in Artificial Intelligence and Applications
Generic entity resolution with negative rules
The VLDB Journal — The International Journal on Very Large Data Bases
Reasoning about record matching rules
Proceedings of the VLDB Endowment
ERACER: a database approach for statistical inference and data cleaning
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Proceedings of the VLDB Endowment
Interaction between record matching and data repairing
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Zchaff2004: an efficient SAT solver
SAT'04 Proceedings of the 7th international conference on Theory and Applications of Satisfiability Testing
Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data
Proceedings of the fifth ACM international conference on Web search and data mining
Towards certain fixes with editing rules and master data
The VLDB Journal — The International Journal on Very Large Data Bases
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
IEEE Transactions on Knowledge and Data Engineering
NADEEF: a generalized data cleaning system
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Despite the increasing importance of data quality and the rich theoretical and practical contributions in all aspects of data cleaning, there is no single end-to-end off-the-shelf solution to (semi-)automate the detection and the repairing of violations w.r.t. a set of heterogeneous and ad-hoc quality constraints. In short, there is no commodity platform similar to general purpose DBMSs that can be easily customized and deployed to solve application-specific data quality problems. In this paper, we present NADEEF, an extensible, generalized and easy-to-deploy data cleaning platform. NADEEF distinguishes between a programming interface and a core to achieve generality and extensibility. The programming interface allows the users to specify multiple types of data quality rules, which uniformly define what is wrong with the data and (possibly) how to repair it through writing code that implements predefined classes. We show that the programming interface can be used to express many types of data quality rules beyond the well known CFDs (FDs), MDs and ETL rules. Treating user implemented interfaces as black-boxes, the core provides algorithms to detect errors and to clean data. The core is designed in a way to allow cleaning algorithms to cope with multiple rules holistically, i.e. detecting and repairing data errors without differentiating between various types of rules. We showcase two implementations for core repairing algorithms. These two implementations demonstrate the extensibility of our core, which can also be replaced by other user-provided algorithms. Using real-life data, we experimentally verify the generality, extensibility, and effectiveness of our system.