Scorpion: explaining away outliers in aggregate queries

Authors:
Eugene Wu;Samuel Madden
Affiliations:
-;-
Venue:
Proceedings of the VLDB Endowment
Year:
2013

Citing 14
Cited 0

Tracing the lineage of view data in a warehousing environment

ACM Transactions on Database Systems (TODS)
Discovery-Driven Exploration of OLAP Data Cubes

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Explaining Differences in Multidimensional Aggregates

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Intelligent Rollups in Multidimensional OLAP Data

Proceedings of the 27th International Conference on Very Large Data Bases
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
A review of feature selection techniques in bioinformatics

Bioinformatics
Approximate lineage for probabilistic databases

Proceedings of the VLDB Endowment
Tracing data errors with view-conditioned causality

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
PerfXplain: debugging MapReduce job performance

Proceedings of the VLDB Endowment
Strategies for crowdsourcing social data analysis

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Profiler: integrated statistical analysis and visualization for data quality assessment

Proceedings of the International Working Conference on Advanced Visual Interfaces
SubZero: A fine-grained lineage system for scientific databases

ICDE '13 Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Database users commonly explore large data sets by running aggregate queries that project the data down to a smaller number of points and dimensions, and visualizing the results. Often, such visualizations will reveal outliers that correspond to errors or surprising features of the input data set. Unfortunately, databases and visualization systems do not provide a way to work backwards from an outlier point to the common properties of the (possibly many) unaggregated input tuples that correspond to that outlier. We propose Scorpion, a system that takes a set of user-specified outlier points in an aggregate query result as input and finds predicates that explain the outliers in terms of properties of the input tuples that are used to compute the selected outlier results. Specifically, this explanation identifies predicates that, when applied to the input data, cause the outliers to disappear from the output. To find such predicates, we develop a notion of influence of a predicate on a given output, and design several algorithms that efficiently search for maximum influence predicates over the input data. We show that these algorithms can quickly find outliers in two real data sets (from a sensor deployment and a campaign finance data set), and run orders of magnitude faster than a naive search algorithm while providing comparable quality on a synthetic data set.