Statistical distortion: consequences of data cleaning

Authors:
Tamraparni Dasu;Ji Meng Loh
Affiliations:
AT&T Labs Research, NJ;AT&T Labs Research, NJ
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 9
Cited 0

Data quality assessment

Communications of the ACM - Supporting community and building social capital
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems (TODS)
Methodologies for data quality assessment and improvement

ACM Computing Surveys (CSUR)
ERACER: a database approach for statistical inference and data cleaning

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
GDR: a system for guided data repair

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Unsupervised clustering of multidimensional distributions using earth mover distance

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a comprehensive suite of experiments and analyses.