Rangesum histograms

Authors:
S. Muthukrishnan;Martin Strauss
Affiliations:
AT&T Labs---Research, Florham Park, NJ;AT&T Labs---Research, Florham Park, NJ
Venue:
SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Year:
2003

Citing 13
Cited 11

Pseudorandom generators for space-bounded computations

STOC '90 Proceedings of the twenty-second annual ACM symposium on Theory of computing
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The space complexity of approximating the frequency moments

Journal of Computer and System Sciences
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Optimal histograms for hierarchical range queries (extended abstract)

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal and approximate computation of summary statistics for range aggregates

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Fast algorithms for hierarchical range histogram construction

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Histogramming Data Streams with Fast Per-Item Processing

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Stable distributions, pseudorandom generators, embeddings and data stream computation

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science

Fast range query estimation by N-level tree histograms

Data & Knowledge Engineering
Improved range-summable random variable construction algorithms

SODA '05 Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
A study on workload-aware wavelet synopses for point and range-sum queries

DOLAP '06 Proceedings of the 9th ACM international workshop on Data warehousing and OLAP
Inner-product based wavelet synopses for range-sum queries

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Hierarchical synopses with optimal error guarantees

ACM Transactions on Database Systems (TODS)
Enhancing histograms by tree-like bucket indices

The VLDB Journal — The International Journal on Very Large Data Bases
Wavelet synopsis for hierarchical range queries with workloads

The VLDB Journal — The International Journal on Very Large Data Bases
On the space---time of optimal, approximate and streaming algorithms for synopsis construction problems

The VLDB Journal — The International Journal on Very Large Data Bases
Multiplicative synopses for relative-error metrics

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Adaptively detecting aggregation bursts in data streams

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

A rangesum query to an array A is a pair (l, r) of range endpoints, which should be answered by Σl≤irA[i]. To compress A, we consider representing an array A lossily by a histogram, a function that is constant on each of a small number of buckets. We then answer range queries from H instead of from A, i.e., as Σl≤irH[i]. An optimal rangesum histogram H for this purpose is one whose bucket boundaries and constant heights within buckets are chosen to minimize the expected square error, El, r[(Σl≤irA[i]--Σl≤irH[i].)2], assuming each rangesum query is equally likely. Rangesum histograms find many applications in database systems.In a degenerate variation, all rangesum queries are over ranges of size one, namely, individual points; histograms optimal for this special case are called pointwise optimal histograms. Pointwise optimal histogram is a classical notion in statistics and approximation theory, but rangesum optimal histogram appears to be novel in these areas. While optimal pointwise histograms can be constructed efficiently by simple dynamic progrmming, no efficient (even approximate) general rangesmn histogram construction algorithms were previously known. In practice, all commercial database systems use heuristically built histograms for pointwise and rangesum queries.We present the first general algorithms for approximate rangesum histograms. Given parameter B, we denote by (α, β)-approximation an algorithm to produce a (αB)-bucket histogram with error at most β times the error of the optimal B-bucket histogram. We give a (2, 1)-approximation with runtime O(N2B), a (2, 1+∊)-approximation with runtime N + (B log(N)/∊)O(1) (1), and a (1, 1 + ∊)-approximation with runtime O(B3N4/∊2). We also consider the problem of dynamic maintenance of rangesum histograms for data updated by additive changes, and we give a (2, 1 + ∊)-approximation that uses space (Blog(N)/∊)O(1) and time (Blog(N)/∊)O(1) for update and query operations. The bounds are nearly competitive with some of the best known bounds for constructing pointwise optimal histograms modulo small additional number of buckets used; however, rangesum histograms are substantially harder to construct because of the long range dependence between subproblems.