On the optimal nesting order for computing N-relational joins
ACM Transactions on Database Systems (TODS)
Equi-depth multidimensional histograms
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
Practical selectivity estimation through adaptive sampling
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
On the propagation of errors in the size of join results
SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation
SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Optimal histograms for limiting worst-case error propagation in the size of join results
ACM Transactions on Database Systems (TODS)
Balancing histogram optimality and practicality for query result size estimation
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Communication complexity
An overview of query optimization in relational systems
PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Random sampling for histogram construction: how much is enough?
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Tracking join and self-join sizes in limited storage
PODS '99 Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximate computation of multidimensional aggregates of sparse data using wavelets
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Implications of certain assumptions in database performance evauation
ACM Transactions on Database Systems (TODS)
Optimal histograms for hierarchical range queries (extended abstract)
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximating multi-dimensional aggregate range queries over real attributes
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
On the complexity of approximate query optimization
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Numerical Recipes in FORTRAN; The Art of Scientific Computing
Numerical Recipes in FORTRAN; The Art of Scientific Computing
Processing complex aggregate queries over data streams
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Exploiting statistics on query expressions for optimization
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Elementary Numerical Analysis: An Algorithmic Approach
Elementary Numerical Analysis: An Algorithmic Approach
Access path selection in a relational database management system
SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Optimal Histograms with Quality Guarantees
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Approximate Query Processing Using Wavelets
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
On B-Tree Indices for Skewed Distributions
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Universality of Serial Histograms
VLDB '93 Proceedings of the 19th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Some complexity questions related to distributive computing(Preliminary Report)
STOC '79 Proceedings of the eleventh annual ACM symposium on Theory of computing
The complexity of massive data set computations
The complexity of massive data set computations
The history of histograms (abridged)
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Combinatorial bounds for list decoding
IEEE Transactions on Information Theory
The VC-dimension of SQL queries and selectivity estimation through sampling
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Hi-index | 0.00 |
Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms are by far the most commonly used. While histograms have proved to be very effective in (cost estimation for) single-table selections, queries with joins have long been seen as a challenge; under a model where histograms are maintained for individual tables, a celebrated result of Ioannidis and Christodoulakis [1991] observes that errors propagate exponentially with the number of joins in a query.In this article, we make two main contributions. First, we study the space complexity of using synopses for query optimization from a novel information-theoretic perspective. In particular, we offer evidence in support of histograms for single-table selections, including an analysis over data distributions known to be common in practice, and illustrate their limitations for join queries. Second, for a broad class of common queries involving joins (specifically, all queries involving only key-foreign key joins) we show that the strategy of storing a small precomputed sample of the database yields probabilistic guarantees that are almost space-optimal, which is an important property if these samples are to be used as database statistics. This is the first such optimality result, to our knowledge, and suggests that precomputed samples might be an effective way to circumvent the error propagation problem for queries with key-foreign key joins. We support this result empirically through an experimental study that demonstrates the effectiveness of precomputed samples, and also shows the increasing difference in the effectiveness of samples versus multidimensional histograms as the number of joins in the query grows.