Concise and accurate data summaries for fast approximate query answering

  • Authors:
  • Kenneth C. Sevcik;Hai Wang

  • Affiliations:
  • -;-

  • Venue:
  • Concise and accurate data summaries for fast approximate query answering
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many techniques have been proposed to support fast approximate query answering using summarized information of the data. Among them, histogram techniques and wavelet techniques are two popular types that have been extensively studied. In this thesis, we investigate the trade-off between the space used and the accuracy of various histogram and wavelet techniques. We also examine their construction costs and query answering time. The major contributions of this thesis are as follows. First, we present a general model for fast approximate query answering in many database applications. This model unifies different scenarios so that histogram and wavelet techniques can be systematically evaluated and compared. Second, we present a thorough experimental evaluation of previously proposed histogram and wavelet techniques. Third, we present a new family of histograms, the Hierarchical Model Fitting (HMF) histograms, based on the Minimum Description Length (MDL) principle, which has been widely used for model selection in statistics and machine learning. The one-dimensional HMF histogram is applicable to one-dimensional data, and the multi-dimensional HMF histogram is applicable to multi-dimensional data. The HMF histograms can be constructed to either seek the highest possible accuracy within a given space budget, or seek the most concise representation that leads to accuracy within a specified tolerance. We show that the HMF histograms are capable of providing more accurate approximations than previously proposed techniques for many real and synthetic data sets across a variety of query workloads. Fourth, using Information Theory, we quantitatively assess the information gain due to each of the different types of histogram information both individually and in combination. Based on theoretical and experimental evidence, we suggest effective heuristics for allocating space to utilize different types of histogram information. We also present a new type of multi-dimensional histogram, called the multi-dimensional Values & Intervals (VI) histogram, that can be constructed in just one scan through the data. All other types of multi-dimensional histograms require much larger construction costs than the multi-dimensional VI histogram, and they are seldom used in practice due to their high construction costs. Through a set of experiments, we show that the multi-dimensional VI histogram is capable of providing more accurate approximations than the techniques currently used in major commercial database management systems, including IBM DB2, Oracle Database, and Microsoft SQL Server, with similar construction time. Finally, we identify the characteristics of the data for which wavelet techniques perform poorly or excellently. We present an algorithm, called the Majorization Ranking Test (MRT) algorithm, to quickly determine which wavelet technique to use for fast approximate query answering (if any). The MRT algorithm also allows us to decide whether to use wavelet techniques or histogram techniques. We also present a new family of wavelet techniques, the Space Efficient Wavelet (SEW) techniques, which improve on previously proposed wavelet techniques by utilizing space in a more efficient way. We show that the SEW techniques dominate previously proposed wavelet techniques in both one-dimensional and multi-dimensional cases.