Histograms and Wavelets on Probabilistic Data

Authors:
Graham Cormode;Minos Garofalakis
Affiliations:
-;-
Venue:
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Year:
2009

Citing 0
Cited 12

RFID Data Aggregation

GSN '09 Proceedings of the 3rd International Conference on GeoSensor Networks
Probabilistic histograms for probabilistic data

Proceedings of the VLDB Endowment
Consistent histograms in the presence of distinct value counts

Proceedings of the VLDB Endowment
Approximating sliding windows by cyclic tree-like histograms for efficient range queries

Data & Knowledge Engineering
On wavelet decomposition of uncertain time series data sets

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
(Approximate) uncertain skylines

Proceedings of the 14th International Conference on Database Theory
Synopses for probabilistic data over large domains

Proceedings of the 14th International Conference on Extending Database Technology
Geometric computations on indecisive points

WADS'11 Proceedings of the 12th international conference on Algorithms and data structures
Histograms as statistical estimators for aggregate queries

Information Systems
Range counting coresets for uncertain data

Proceedings of the twenty-ninth annual symposium on Computational geometry
Efficient and scalable monitoring and summarization of large probabilistic data

Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Bichromatic buckets: An effective technique to improve the accuracy of histograms for geographic data points

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamic programming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.