Histograms revisited: when are histograms the best approximation method for aggregates over joins?

  • Authors:
  • Alin Dobra

  • Affiliations:
  • University of Florida, Gainesville, FL

  • Venue:
  • Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The traditional statistical assumption for interpreting histograms and justifying approximate query processing methods based on them is that all elements in a bucket have the same frequency -- the so called uniform distribution assumption. In this paper we show that a significantly less restrictive statistical assumption - the elements within a bucket are randomly arranged even though they might have different frequencies -- leads to identical formulae for approximating aggregate queries using histograms. This observation allows us to identify scenarios in which histograms are well suited as approximation methods -- in fact we show that in these situations sampling and sketching are significantly worse -- and provide tight error guarantees for the quality of approximations. At the same time we show that, on average, histograms are rather poor approximators outside these scenarios.