Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes

Authors:
Mohamed Yakout;Laure Berti-Équille;Ahmed K. Elmagarmid
Affiliations:
Microsoft Corp., Bellevue, WA, USA;Institut de Recherche pour le Développement, Aix-en-Provence, France;Qatar Computing Research Institute, Doha, Qatar
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 19
Cited 0

Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images

Neurocomputing: foundations of research
Polynomial time approximation schemes for dense instances of NP-hard problems

STOC '95 Proceedings of the twenty-seventh annual ACM symposium on Theory of computing
Greedily finding a dense subgraph

Journal of Algorithms
Ensemble Methods in Machine Learning

MCS '00 Proceedings of the First International Workshop on Multiple Classifier Systems
Dependency networks for inference, collaborative filtering, and data visualization

The Journal of Machine Learning Research
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Large Scale Multiple Kernel Learning

The Journal of Machine Learning Research
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Adaptive mixtures of local experts

Neural Computation
Dependencies revisited for improving data quality

Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
On approximating optimum repairs for functional dependency violations

Proceedings of the 12th International Conference on Database Theory
Discovering Conditional Functional Dependencies

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Correlation-based detection of attribute outliers

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems

Information Quality Applied: Best Practices for Improving Business Information, Processes and Systems
Towards certain fixes with editing rules and master data

Proceedings of the VLDB Endowment
Guided data repair

Proceedings of the VLDB Endowment
Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Various computational procedures or constraint-based methods for data repairing have been proposed over the last decades to identify errors and, when possible, correct them. However, these approaches have several limitations including the scalability and quality of the values to be used in replacement of the errors. In this paper, we propose a new data repairing approach that is based on maximizing the likelihood of replacement data given the data distribution, which can be modeled using statistical machine learning techniques. This is a novel approach combining machine learning and likelihood methods for cleaning dirty databases by value modification. We develop a quality measure of the repairing updates based on the likelihood benefit and the amount of changes applied to the database. We propose SCARE (SCalable Automatic REpairing), a systematic scalable framework that follows our approach. SCARE relies on a robust mechanism for horizontal data partitioning and a combination of machine learning techniques to predict the set of possible updates. Due to data partitioning, several updates can be predicted for a single record based on local views on each data partition. Therefore, we propose a mechanism to combine the local predictions and obtain accurate final predictions. Finally, we experimentally demonstrate the effectiveness, efficiency, and scalability of our approach on real-world datasets in comparison to recent data cleaning approaches.