ERACER: a database approach for statistical inference and data cleaning

Authors:
Chris Mayfield;Jennifer Neville;Sunil Prabhakar
Affiliations:
Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA;Purdue University, West Lafayette, IN, USA
Venue:
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Year:
2010

Citing 16
Cited 8

Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Exploratory Data Mining and Data Cleaning

Exploratory Data Mining and Data Cleaning
Dependency networks for inference, collaborative filtering, and data visualization

The Journal of Machine Learning Research
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Measures of distributional similarity

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
TinyDB: an acquisitional query processing system for sensor networks

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data cleaning using belief propagation

Proceedings of the 2nd international workshop on Information quality in information systems
Clean Answers over Dirty Databases: A Probabilistic Approach

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
Relational Dependency Networks

The Journal of Machine Learning Research
Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)

Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning)
Improving data quality: consistency and accuracy

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Querying continuous functions in a database system

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
A revival of integrity constraints for data cleaning

Proceedings of the VLDB Endowment
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment

Interaction between record matching and data repairing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Functional dependency discovery via Bayes net analysis

MAMECTIS/NOLASC/CONTROL/WAMUS'11 Proceedings of the 13th WSEAS international conference on mathematical methods, computational techniques and intelligent systems, and 10th WSEAS international conference on non-linear analysis, non-linear systems and chaos, and 7th WSEAS international conference on dynamical systems and control, and 11th WSEAS international conference on Wavelet analysis and multirate systems: recent researches in computational techniques, non-linear systems and control
Relational approach for shortest path discovery over large graphs

Proceedings of the VLDB Endowment
Statistical distortion: consequences of data cleaning

Proceedings of the VLDB Endowment
Xtream: a system for continuous querying over uncertain data streams

SUM'12 Proceedings of the 6th international conference on Scalable Uncertainty Management
Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
NADEEF: a commodity data cleaning system

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Probabilistic graph summarization

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into modern DBMSs. We present ERACER, an iterative statistical framework for inferring missing information and correcting such errors automatically. Our approach is based on belief propagation and relational dependency networks, and includes an efficient approximate inference algorithm that is easily implemented in standard DBMSs using SQL and user defined functions. The system performs the inference and cleansing tasks in an integrated manner, using shrinkage techniques to infer correct values accurately even in the presence of dirty data. We evaluate the proposed methods empirically on multiple synthetic and real-world data sets. The results show that our framework achieves accuracy comparable to a baseline statistical method using Bayesian networks with exact inference. However, our framework has wider applicability than the Bayesian network baseline, due to its ability to reason with complex, cyclic relational dependencies.