A demonstration of DBWipes: clean as you query

Authors:
Eugene Wu;Samuel Madden;Michael Stonebraker
Affiliations:
MIT CSAIL;MIT CSAIL;MIT CSAIL
Venue:
Proceedings of the VLDB Endowment
Year:
2012

Citing 5
Cited 1

Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Subgroup Discovery with CN2-SD

The Journal of Machine Learning Research
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles

Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Wrangler: interactive visual specification of data transformation scripts

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Intel "big data" science and technology center vision and execution plan

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

As data analytics becomes mainstream, and the complexity of the underlying data and computation grows, it will be increasingly important to provide tools that help analysts understand the underlying reasons when they encounter errors in the result. While data provenance has been a large step in providing tools to help debug complex workflows, its current form has limited utility when debugging aggregation operators that compute a single output from a large collection of inputs. Traditional provenance will return the entire input collection, which has very low precision. In contrast, users are seeking precise descriptions of the inputs that caused the errors. We propose a Ranked Provenance System, which identifies subsets of inputs that influenced the output error, describes each subset with human readable predicates and orders them by contribution to the error. In this demonstration, we will present DBWipes, a novel data cleaning system that allows users to execute aggregate queries, and interactively detect, understand, and clean errors in the query results. Conference attendees will explore anomalies in campaign donations from the current US presidential election and in readings from a 54-node sensor deployment.