GDR: a system for guided data repair

Authors:
Mohamed Yakout;Ahmed K. Elmagarmid;Jennifer Neville;Mourad Ouzzani
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 10
Cited 4

AJAX: an extensible data cleaning tool

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Random Forests

Machine Learning
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)

Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications)
ALIAS: an active learning led interactive deduplication system

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems (TODS)
Semandaq: a data quality system based on conditional functional dependencies

Proceedings of the VLDB Endowment
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Guided data repair

Proceedings of the VLDB Endowment
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Statistical distortion: consequences of data cleaning

Proceedings of the VLDB Endowment
The data analytics group at the qatar computing research institute

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Improving data quality is a time-consuming, labor-intensive and often domain specific operation. Existing data repair approaches are either fully automated or not efficient in interactively involving the users. We present a demo of GDR, a Guided Data Repair system that uses a novel approach to efficiently involve the user alongside automatic data repair techniques to reach better data quality as quickly as possible. Specifically, GDR generates data repairs and acquire feedback on them that would be most beneficial in improving the data quality. GDR quantifies the data quality benefit of generated repairs by combining mechanisms from decision theory and active learning. Based on these benefit scores, groups of repairs are ranked and displayed to the user. User feedback is used to train a machine learning component to eventually replace the user in deciding on the validity of a suggested repair. We describe how the generated repairs are ranked and displayed to the user in a "useful-looking" way and demonstrate how data quality can be effectively improved with minimal feedback from the user.