Approximate lineage for probabilistic databases

Authors:
Christopher Ré;Dan Suciu
Affiliations:
University of Washington, Seattle;University of Washington, Seattle
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 35
Cited 18

Incomplete Information in Relational Databases

Journal of the ACM (JACM)
Computational limitations of small-depth circuits

Computational limitations of small-depth circuits
A logic for reasoning about probabilities

Information and Computation - Selections from 1988 IEEE symposium on logic in computer science
Learning decision trees using the Fourier spectrum

SIAM Journal on Computing
Constant depth circuits, Fourier transform, and learnability

Journal of the ACM (JACM)
Weakly learning DNF and characterizing statistical query learning using Fourier analysis

STOC '94 Proceedings of the twenty-sixth annual ACM symposium on Theory of computing
Randomized algorithms

Randomized algorithms
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
On the Fourier spectrum of monotone functions

Journal of the ACM (JACM)
A probabilistic relational algebra for the integration of information retrieval and database systems

ACM Transactions on Information Systems (TOIS)
The complexity of query reliability

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
An Introduction to Variational Methods for Graphical Models

Machine Learning
SPARTAN: a model-based semantic compression system for massive data tables

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Selectivity estimation using probabilistic models

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
Probabilistic Networks and Expert Systems

Probabilistic Networks and Expert Systems
Compressing Relations and Indexes

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Computational applications of noise sensitivity

Computational applications of noise sensitivity
Probabilistic wavelet synopses

ACM Transactions on Database Systems (TODS)
A Switching Lemma for Small Restrictions and Lower Bounds for k-DNF Resolution

SIAM Journal on Computing
On learning monotone DNF under product distributions

Information and Computation
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
C-store: a column-oriented DBMS

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Provenance management in curated databases

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A formal analysis of information disclosure in data exchange

Journal of Computer and System Sciences
Extended wavelets for multiple measures

ACM Transactions on Database Systems (TODS)
ORCHESTRA: facilitating collaborative data sharing

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Provenance semirings

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Materialized views in probabilistic databases: for information exchange and query optimization

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Databases with uncertainty and lineage

The VLDB Journal — The International Journal on Very Large Data Bases
Causes and explanations: a structural-model approach-part II: explanations

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 1

Managing Probabilistic Data with MystiQ: The Can-Do, the Could-Do, and the Can't-Do

SUM '08 Proceedings of the 2nd international conference on Scalable Uncertainty Management
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
Transducing Markov sequences

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Lineage processing over correlated probabilistic databases

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
I4E: interactive investigation of iterative information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficiently computing and querying multidimensional OLAP data cubes over probabilistic relational data

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
The Foundations for Provenance on the Web

Foundations and Trends in Web Science
Schema-as-you-go: on probabilistic tagging and querying of wide tables

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Probabilistic management of OCR data using an RDBMS

Proceedings of the VLDB Endowment
On Provenance Minimization

ACM Transactions on Database Systems (TODS)
A top-k filter for logic-based similarity conditions on probabilistic databases

ADBIS'12 Proceedings of the 16th East European conference on Advances in Databases and Information Systems
Flexible Provenance Tracing

International Journal of Systems and Service-Oriented Engineering
Distributed time-aware provenance

Proceedings of the VLDB Endowment
Towards design support for provenance awareness: a classification of provenance questions

Proceedings of the Joint EDBT/ICDT 2013 Workshops
Local clustering in provenance graphs

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Scorpion: explaining away outliers in aggregate queries

Proceedings of the VLDB Endowment
Anytime approximation in probabilistic databases

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set.