Improving estimation accuracy of aggregate queries on data cubes

Authors:
E. Pourabbas;A. Shoshani
Affiliations:
Italian National Research Council, Istituto di Analsi dei Sistemi ed Informatica "Antonio Ruberti", Viale Manzoni 30, 00185 Rome, Italy;Lawrence Berkeley National Laboratory, Mailstop 50B-3238, 1 Cyclotron Road, Berkeley, CA 94720, USA
Venue:
Data & Knowledge Engineering
Year:
2010

Citing 12
Cited 0

A universal-scheme approach to statistical databases containing homogeneous summary tables

ACM Transactions on Database Systems (TODS)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An implementation of the iterative proportional fitting procedure by propagation trees

Computational Statistics & Data Analysis
Summarizability in OLAP and Statistical Data Bases

SSDBM '97 Proceedings of the Ninth International Conference on Scientific and Statistical Database Management
Customized Answers to Summary Queries via Aggregate Views

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Using Datacube Aggregates for Approximate Querying and Deviation Detection

IEEE Transactions on Knowledge and Data Engineering
Theory of Relational Databases

Theory of Relational Databases
Local computation of answers to table queries on summary databases

SSDBM'2005 Proceedings of the 17th international conference on Scientific and statistical database management
Efficient estimation of joint queries from multiple OLAP databases

ACM Transactions on Database Systems (TODS)
Improving estimation accuracy of aggregate queries on data cubes

Proceedings of the ACM 11th international workshop on Data warehousing and OLAP

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we investigate the problem of estimation of a target database from summary databases derived from a base data cube. We show that such estimates can be derived by choosing a primary database with the desired target measure but not the desired dimensions, and use a proxy database to estimate the results. This technique is common in statistics, but an important issue we are addressing is the accuracy of these estimates. Specifically, given multiple primary and multiple proxy databases, the problem is how to select the primary and proxy databases that will generate the most accurate target database estimation possible. We propose an algorithmic approach which makes use of the principles of information entropy for determining the steps to select or compute the primary and proxy databases that provide the most accurate target database. We show that the primary database with the largest number of cells in common with the target database and the proxy database provides the more accurate estimates. We prove that this is consistent with maximizing the entropy. We provide some experimental results on the accuracy of the target database estimation in order to verify our results. Furthermore, we investigate the accuracy results in cases where the dimensions are defined over a hierarchy of categories and roll-up and drill-down operations are needed to generate the desired target results.