Structure choices for two-dimensional histogram construction

  • Authors:
  • Hang T. A. Pham;Kenneth C. Sevcik

  • Affiliations:
  • Department of Computer Science, University of Toronto;Department of Computer Science, University of Toronto

  • Venue:
  • CASCON '04 Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Histograms of the distributions of individual attributes are currently used in leading database management systems (e.g., IBM DB2, Oracle Database, and Microsoft SQL server). Because attribute pairs in databases are seldom independent, however, the use of the distributions of individual attributes with the attribute independence assumption often leads to poor estimates. More accurate answers can be obtained by using multi-dimensional histograms to characterize the joint distribution of two or more attributes. When moving from one-dimensional to two-dimensional histograms, several new issues relating to histogram structure arise: (1) Which attribute should take priority over the other with respect to data partitioning?; (2) Into how many partitions should each dimension be split to obtain a desired number of histogram buckets?; and (3) How many most frequent values should be isolated and stored in singleton buckets? In the context of real data, we experimentally show that our proposed methods for dealing with histogram structure choices lead to good quality histograms for a variety of histogram partitioning techniques and various types of data distributions.