Probabilistic histograms for probabilistic data

Authors:
Graham Cormode;Antonios Deligiannakis;Minos Garofalakis;Andrew McGregor
Affiliations:
AT&T Labs--Research;Technical University of Crete;Technical University of Crete;University of Massachusetts, Amherst
Venue:
Proceedings of the VLDB Endowment
Year:
2009

Citing 21
Cited 7

Computational geometry: an introduction

Computational geometry: an introduction
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Mining Deviants in a Time Series Database

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Scaling and related techniques for geometry problems

STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
MYSTIQ: a system for finding more answers by using probabilities

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
Sketching probabilistic data streams

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating statistical aggregates on probabilistic data streams

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient aggregation algorithms for probabilistic data

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
MCDB: a monte carlo approach to managing uncertain data

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Managing and Mining Uncertain Data

Managing and Mining Uncertain Data
Fast and Simple Relational Processing of Uncertain Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Database Support for Probabilistic Attributes and Tuples

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Histograms and Wavelets on Probabilistic Data

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering

(Approximate) uncertain skylines

Proceedings of the 14th International Conference on Database Theory
Synopses for probabilistic data over large domains

Proceedings of the 14th International Conference on Extending Database Technology
DuoWave: Mitigating the curse of dimensionality for uncertain data

Data & Knowledge Engineering
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Histograms as statistical estimators for aggregate queries

Information Systems
Efficient and scalable monitoring and summarization of large probabilistic data

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Bichromatic buckets: An effective technique to improve the accuracy of histograms for geographic data points

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a growing realization that modern database management systems (DBMSs) must be able to manage data that contains uncertainties that are represented in the form of probabilistic relations. Consequently, the design of each core DBMS component must be revisited in the presence of uncertain and probabilistic information. In this paper, we study how to build histogram synopses for probabilistic relations, for the purposes of enabling both DBMS-internal decisions (such as indexing and query planning), and (possibly, user-facing) approximate query processing tools. In contrast to initial work in this area, our probabilistic histograms retain the key possible-worlds semantics of probabilistic data, allowing for more accurate, yet concise, representation of the uncertainty characteristics of data and query results. We present a variety of techniques for building optimal probabilistic histograms, each one tuned to a different choice of approximation-error metric. We show that these can be incorporated into a general Dynamic Programming (DP) framework, which generalizes that used for existing histogram constructions. The end result is a histogram where each "bucket" is approximately represented by a compact probability distribution function (PDF), which can be used as the basis for query planning and approximate query answering. We present novel, polynomial-time algorithms to find optimal probabilistic histograms for a variety of PDF-error metrics (including variation distance, sum squared error, max error and EMD1). Our experimental study shows that our probabilistic histogram synopses can accurately capture the key statistical properties of uncertain data, while being much more compact to store and work with than the original uncertain relations.