Approximation and streaming algorithms for histogram construction problems

  • Authors:
  • Sudipto Guha;Nick Koudas;Kyuseok Shim

  • Affiliations:
  • University of Pennsylvania, Philadelphia, PA;University of Toronto, Ont, Canada;Seoul National University, Seoul, Korea

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Histograms and related synopsis structures are popular techniques for approximating data distributions. These have been successful in query optimization and a variety of applications, including approximate querying, similarity searching, and data mining, to name a few. Histograms were a few of the earliest synopsis structures proposed and continue to be used widely. The histogram construction problem is to construct the best histogram restricted to a space bound that reflects the data distribution most accurately under a given error measure.The histograms are used as quick and easy estimates. Thus, a slight loss of accuracy, compared to the optimal histogram under the given error measure, can be offset by fast histogram construction algorithms. A natural question arises in this context: Can we find a fast near optimal approximation algorithm for the histogram construction problem? In this article, we give the first linear time (1+ε)-factor approximation algorithms (for any ε 0) for a large number of histogram construction problems including the use of piecewise small degree polynomials to approximate data, workloads, etc. Several of our algorithms extend to data streams.Using synthetic and real-life data sets, we demonstrate that in many scenarios the approximate histograms are almost identical to optimal histograms in quality and are significantly faster to construct.