Independence is good: dependency-based histogram synopses for high-dimensional data

Authors:
Amol Deshpande;Minos Garofalakis;Rajeev Rastogi
Affiliations:
Univ of California, Berkeley and Bell Laboratories;Bell Laboratories;Bell Laboratories
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 9
Cited 61

Resource allocation problems: algorithmic approaches

Resource allocation problems: algorithmic approaches
Probabilistic reasoning in intelligent systems: networks of plausible inference

Probabilistic reasoning in intelligent systems: networks of plausible inference
Introduction to algorithms

Introduction to algorithms
Optimal histograms for limiting worst-case error propagation in the size of join results

ACM Transactions on Database Systems (TODS)
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Relaxing the uniformity and independence assumptions using the concept of fractal dimension

Journal of Computer and System Sciences - Special issue on principles of database systems
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Network Data Mining and Analysis: The NEMESIS Project

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Extended wavelets for multiple measures

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data

IEEE Transactions on Knowledge and Data Engineering
Screening and interpreting multi-item associations based on log-linear modeling

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The power-method: a comprehensive estimation technique for multi-dimensional queries

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Analysis of predictive spatio-temporal queries

ACM Transactions on Database Systems (TODS)
A multi-dimensional histogram for selectivity estimation and fast approximate query answering

CASCON '03 Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative research
Probabilistic wavelet synopses

ACM Transactions on Database Systems (TODS)
Selectivity Estimation for XML Twigs

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Conditional selectivity for statistics on query expressions

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
An Efficient Subspace Sampling Framework for High-Dimensional Data Reduction, Selectivity Estimation, and Nearest-Neighbor Search

IEEE Transactions on Knowledge and Data Engineering
Maintaining Implicated Statistics in Constrained Environments

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Deterministic wavelet thresholding for maximum-error metrics

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Towards a robust query optimizer: a principled and practical approach

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Consistently estimating the selectivity of conjuncts of predicates

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Content-based routing: different plans for different data

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Self-tuning cost modeling of user-defined functions in an object-relational DBMS

ACM Transactions on Database Systems (TODS)
Wavelet synopses for general error metrics

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Graph-based synopses for relational selectivity estimation

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
GORDIAN: efficient and scalable discovery of composite keys

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive density estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
XSKETCH synopses for XML data graphs

ACM Transactions on Database Systems (TODS)
Consistent selectivity estimation via maximum entropy

The VLDB Journal — The International Journal on Very Large Data Bases
Compressed histograms with arbitrary bucket layouts for selectivity estimation

Information Sciences: an International Journal
Management of probabilistic data: foundations and challenges

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Structure and value synopses for XML data graphs

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Selectivity estimation of range queries based on data density approximation via cosine series

Data & Knowledge Engineering
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Distributed top-N query processing with possibly uncooperative local systems

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
SASH: a self-adaptive histogram set for dynamically changing workloads

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Model-driven data acquisition in sensor networks

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detecting attribute dependencies from query feedback

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Probabilistic graphical models and their role in databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Adaptive query processing

Foundations and Trends in Databases
Histograms based on the minimum description length principle

The VLDB Journal — The International Journal on Very Large Data Bases
Hierarchical synopses with optimal error guarantees

ACM Transactions on Database Systems (TODS)
Compressed hierarchical binary histograms for summarizing multi-dimensional data

Knowledge and Information Systems
On space constrained set selection problems

Data & Knowledge Engineering
Smooth Interpolating Histograms with Error Guarantees

BNCOD '08 Proceedings of the 25th British national conference on Databases: Sharing Data, Information and Knowledge
Architecture of a Database System

Foundations and Trends in Databases
TuG synopses for approximate query answering

ACM Transactions on Database Systems (TODS)
Multiplicative synopses for relative-error metrics

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
A new look at generating multi-join continuous query plans: A qualified plan generation problem

Data & Knowledge Engineering
An efficient histogram method for outlier detection

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Deriving predicate statistics in datalog

Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming
Sharing-aware horizontal partitioning for exploiting correlations during query processing

Proceedings of the VLDB Endowment
A quad-tree based multiresolution approach for two-dimensional summary data

Information Systems
Efficient stepwise selection in decomposable models

UAI'01 Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence
Learning approximate MRFs from large transaction data

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
HASE: a hybrid approach to selectivity estimation for conjunctive predicates

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Sensitivity of self-tuning histograms: query order affecting accuracy and robustness

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
Deriving predicate statistics for logic rules

RR'12 Proceedings of the 6th international conference on Web Reasoning and Rule Systems
Efficiently adapting graphical models for selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Selectivity estimation for hybrid queries over text-rich data graphs

Proceedings of the 16th International Conference on Extending Database Technology
CS2: a new database synopsis for query estimation

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximating the joint data distribution of a multi-dimensional data set through a compact and accurate histogram synopsis is a fundamental problem arising in numerous practical scenarios, including query optimization and approximate query answering. Existing solutions either rely on simplistic independence assumptions or try to directly approximate the full joint data distribution over the complete set of attributes. Unfortunately, both approaches are doomed to fail for high-dimensional data sets with complex correlation patterns between attributes. In this paper, we propose a novel approach to histogram-based synopses that employs the solid foundation of statistical interaction models to explicitly identify and exploit the statistical characteristics of the data. Abstractly, our key idea is to break the synopsis into (1) a statistical interaction model that accurately captures significant correlation and independence patterns in data, and (2) a collection of histograms on low-dimensional marginals that, based on the model, can provide accurate approximations of the overall joint data distribution. Extensive experimental results with several real-life data sets verify the effectiveness of our approach. An important aspect of our general, model-based methodology is that it can be used to enhance the performance of other synopsis techniques that are based on data-space partitioning (e.g., wavelets) by providing an effective tool to deal with the “dimensionality curse”.