Confidence bounds for sampling-based group by estimates

  • Authors:
  • Fei Xu;Christopher Jermaine;Alin Dobra

  • Affiliations:
  • University of Florida, Gainesville, Gainesville, FL;University of Florida, Gainesville, Gainesville, FL;University of Florida, Gainesville, Gainesville, FL

  • Venue:
  • ACM Transactions on Database Systems (TODS)
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sampling is now a very important data management tool, to such an extent that an interface for database sampling is included in the latest SQL standard. In this article we reconsider in depth what at first may seem like a very simple problem—computing the error of a sampling-based guess for the answer to a GROUP BY query over a multitable join. The difficulty when sampling for the answer to such a query is that the same sample will be used to guess the result of the query for each group, which induces correlations among the estimates. Thus, from a statistical point-of-view it is very problematic and even dangerous to use traditional methods such as confidence intervals for communicating estimate accuracy to the user. We explore ways to address this problem, and pay particular attention to the computational aspects of computing “safe” confidence intervals.