Compressed histograms with arbitrary bucket layouts for selectivity estimation

  • Authors:
  • Dennis Fuchs;Zhen He;Byung Suk Lee

  • Affiliations:
  • Tele Atlas, Lebanon, NH 03766, USA;Department of Computer Science and Computer Engineering, La Trobe University, Plenty Road, Bundoora, Vic. 3086, Australia;Department of Computer Science, University of Vermont, Burlington, VT 05405, USA

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2007

Quantified Score

Hi-index 0.07

Visualization

Abstract

Selectivity estimation is an important step of query optimization in a database management system, and multi-dimensional histogram techniques have proved promising for selectivity estimation. Recent multi-dimensional histogram techniques such as GenHist and STHoles use an arbitrary bucket layout. This layout has the advantage of requiring a smaller number of buckets to model tuple densities than those required by the traditional grid or recursive layouts. However, the arbitrary bucket layout brings an inherent disadvantage of requiring more memory to store each bucket location information. This diminishes the advantage of requiring fewer buckets and, therefore, has an adverse effect on the resulting selectivity estimation accuracy. To our knowledge, however, no existing histogram-based technique with arbitrary layout addresses this issue. In this paper, we introduce the idea of bucket location compression and then demonstrate its effectiveness for improving selectivity estimation accuracy by proposing the STHoles+ technique. STHoles+ extends STHoles by quantizing each coordinate of a bucket relative to the coordinate of the smallest enclosing bucket. This quantization increases the number of histogram buckets that can be stored in the histogram. Our quantization scheme allows STHoles+ to trade precision of histogram bucket locations for storing more buckets. Experimental results show that STHoles+ outperforms STHoles on various data distributions, query distributions, and other factors such as available memory size, quantization resolution, and dimensionality of the data space.