Fast approximate computation of statistics on views

  • Authors:
  • Calisto Zuzarte;Xiaohui Yu

  • Affiliations:
  • IBM Toronto Lab, Markham, ON, Canada;University of Toronto, Toronto, ON, Canada

  • Venue:
  • Proceedings of the 2006 ACM SIGMOD international conference on Management of data
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Accurate estimation of the sizes of intermediate query results (cardinality estimation) is of critical importance to plan costing in query optimization. The common practice in current commercial database systems such as IBM DB2 Universal Database (DB2 UDB) is to derive the cardinality estimates from base-table statistics. However, this approach often suffers from simplifying yet unrealistic assumptions that have to be made about the underlying data (for example, different attributes are independently distributed).Ways for exploiting statistics on query expressions (or, statistics on views, or SITs) have been proposed to improve the accuracy of cardinality estimation. We propose a novel method for efficient computation of SITs for joins. In particular, we are concerned with statistics on join queries involving large fact tables and relatively small dimension tables. Rather than materializing the views, we make use of the frequency statistics that are available on the fact tables to obtain an approximate estimate of the statistics on various attributes in the join results. The dimension tables are generally much smaller than the fact table, and therefore we can afford to closely examine the dimension table, while at the same time avoid accessing the fact table. By closely examining the dimension table, we are able to capture the correlations between the attributes in the dimension table as well as the skew and domain range of the fact table join column values. This leads to reasonably accurate statistics on the join result. We prototyped this idea as a module on top of DB2 UDB, and our experience shows that employment of this technique results in a very significant speed-up in the computation of SITs, at the expense of only slight degradation in accuracy compared with the full-materialization method.