Rangesum histograms

  • Authors:
  • S. Muthukrishnan;Martin Strauss

  • Affiliations:
  • AT&T Labs---Research, Florham Park, NJ;AT&T Labs---Research, Florham Park, NJ

  • Venue:
  • SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

A rangesum query to an array A is a pair (l, r) of range endpoints, which should be answered by Σl≤irA[i]. To compress A, we consider representing an array A lossily by a histogram, a function that is constant on each of a small number of buckets. We then answer range queries from H instead of from A, i.e., as Σl≤irH[i]. An optimal rangesum histogram H for this purpose is one whose bucket boundaries and constant heights within buckets are chosen to minimize the expected square error, El, r[(Σl≤irA[i]--Σl≤irH[i].)2], assuming each rangesum query is equally likely. Rangesum histograms find many applications in database systems.In a degenerate variation, all rangesum queries are over ranges of size one, namely, individual points; histograms optimal for this special case are called pointwise optimal histograms. Pointwise optimal histogram is a classical notion in statistics and approximation theory, but rangesum optimal histogram appears to be novel in these areas. While optimal pointwise histograms can be constructed efficiently by simple dynamic progrmming, no efficient (even approximate) general rangesmn histogram construction algorithms were previously known. In practice, all commercial database systems use heuristically built histograms for pointwise and rangesum queries.We present the first general algorithms for approximate rangesum histograms. Given parameter B, we denote by (α, β)-approximation an algorithm to produce a (αB)-bucket histogram with error at most β times the error of the optimal B-bucket histogram. We give a (2, 1)-approximation with runtime O(N2B), a (2, 1+∊)-approximation with runtime N + (B log(N)/∊)O(1) (1), and a (1, 1 + ∊)-approximation with runtime O(B3N4/∊2). We also consider the problem of dynamic maintenance of rangesum histograms for data updated by additive changes, and we give a (2, 1 + ∊)-approximation that uses space (Blog(N)/∊)O(1) and time (Blog(N)/∊)O(1) for update and query operations. The bounds are nearly competitive with some of the best known bounds for constructing pointwise optimal histograms modulo small additional number of buckets used; however, rangesum histograms are substantially harder to construct because of the long range dependence between subproblems.