Communications of the ACM - Supporting community and building social capital
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Conditional functional dependencies for capturing data inconsistencies
ACM Transactions on Database Systems (TODS)
Methodologies for data quality assessment and improvement
ACM Computing Surveys (CSUR)
ERACER: a database approach for statistical inference and data cleaning
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
GDR: a system for guided data repair
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
Unsupervised clustering of multidimensional distributions using earth mover distance
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
We introduce the notion of statistical distortion as an essential metric for measuring the effectiveness of data cleaning strategies. We use this metric to propose a widely applicable yet scalable experimental framework for evaluating data cleaning strategies along three dimensions: glitch improvement, statistical distortion and cost-related criteria. Existing metrics focus on glitch improvement and cost, but not on the statistical impact of data cleaning strategies. We illustrate our framework on real world data, with a comprehensive suite of experiments and analyses.