Time-Interval Sampling for Improved Estimations in Data Warehouses

Authors:
Pedro Furtado;João Pedro Costa
Affiliations:
-;-
Venue:
DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Year:
2002

Citing 8
Cited 2

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Dynamic generation of discrete random variates

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Approximate data structures with applications

SODA '94 Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Large-Sample and Deterministic Confidence Intervals for Online Aggregation

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management

DSQoS-distributed architecture providing QoS in summary warehouses

DOLAP '03 Proceedings of the 6th ACM international workshop on Data warehousing and OLAP
Estimating the overlapping area of polygon join

SSTD'05 Proceedings of the 9th international conference on Advances in Spatial and Temporal Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In large data warehouses it is possible to return very fast approximate answers to user queries using pre-computed sampling summaries well-fit for all types of exploration analysis. However, their usage is constrained by the fact that there must be a representative number of samples in grouping intervals to yield acceptable accuracy. In this paper we propose and evaluate a technique that deals with the representation issue by using time interval-biased stratified samples (TISS). The technique is able to deliver fast accurate analysis to the user by taking advantage of the importance of the time dimension in most user analysis. It is designed as a transparent middle layer, which analyzes and rewrites the query to use a summary instead of the base data warehouse. The estimations and error bounds returned using the technique are compared to those of traditional sampling summaries, to show that it achieves significant improvement in accuracy.