Effective use of block-level sampling in statistics estimation

  • Authors:
  • Surajit Chaudhuri;Gautam Das;Utkarsh Srivastava

  • Affiliations:
  • Microsoft Research;Microsoft Research;Stanford University

  • Venue:
  • SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods.