Computational geometry: an introduction
Computational geometry: an introduction
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Mining Deviants in a Time Series Database
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Scaling and related techniques for geometry problems
STOC '84 Proceedings of the sixteenth annual ACM symposium on Theory of computing
A divisive information theoretic feature clustering algorithm for text classification
The Journal of Machine Learning Research
MYSTIQ: a system for finding more answers by using probabilities
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximation and streaming algorithms for histogram construction problems
ACM Transactions on Database Systems (TODS)
Sketching probabilistic data streams
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Management of probabilistic data: foundations and challenges
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Estimating statistical aggregates on probabilistic data streams
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient aggregation algorithms for probabilistic data
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
REHIST: relative error histogram construction algorithms
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
MCDB: a monte carlo approach to managing uncertain data
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Managing and Mining Uncertain Data
Managing and Mining Uncertain Data
Fast and Simple Relational Processing of Uncertain Data
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Database Support for Probabilistic Attributes and Tuples
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Histograms and Wavelets on Probabilistic Data
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
(Approximate) uncertain skylines
Proceedings of the 14th International Conference on Database Theory
Synopses for probabilistic data over large domains
Proceedings of the 14th International Conference on Extending Database Technology
DuoWave: Mitigating the curse of dimensionality for uncertain data
Data & Knowledge Engineering
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Histograms as statistical estimators for aggregate queries
Information Systems
Efficient and scalable monitoring and summarization of large probabilistic data
Proceedings of the 2013 Sigmod/PODS Ph.D. symposium on PhD symposium
Data & Knowledge Engineering
Hi-index | 0.00 |
There is a growing realization that modern database management systems (DBMSs) must be able to manage data that contains uncertainties that are represented in the form of probabilistic relations. Consequently, the design of each core DBMS component must be revisited in the presence of uncertain and probabilistic information. In this paper, we study how to build histogram synopses for probabilistic relations, for the purposes of enabling both DBMS-internal decisions (such as indexing and query planning), and (possibly, user-facing) approximate query processing tools. In contrast to initial work in this area, our probabilistic histograms retain the key possible-worlds semantics of probabilistic data, allowing for more accurate, yet concise, representation of the uncertainty characteristics of data and query results. We present a variety of techniques for building optimal probabilistic histograms, each one tuned to a different choice of approximation-error metric. We show that these can be incorporated into a general Dynamic Programming (DP) framework, which generalizes that used for existing histogram constructions. The end result is a histogram where each "bucket" is approximately represented by a compact probability distribution function (PDF), which can be used as the basis for query planning and approximate query answering. We present novel, polynomial-time algorithms to find optimal probabilistic histograms for a variety of PDF-error metrics (including variation distance, sum squared error, max error and EMD1). Our experimental study shows that our probabilistic histogram synopses can accurately capture the key statistical properties of uncertain data, while being much more compact to store and work with than the original uncertain relations.