Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
An array-based algorithm for simultaneous multidimensional aggregates
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data cube approximation and histograms via wavelets
Proceedings of the seventh international conference on Information and knowledge management
Selectivity Estimation Without the Attribute Value Independence Assumption
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Recovering Information from Summary Data
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Adding a Performance-Oriented Perspective to Data Warehouse Design
DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
Hi-index | 0.00 |
There is a growing interest in the analysis of data in warehouses. Data warehouses can be extremely large and typical queries frequently take too long to answer. Manageable and portable summaries return interactive response times in exploratory data analysis. Obtaining the best estimates for smaller response times and storage needs is the objective of simple data reduction techniques that usually produce coarse approximations. But because the user is exposed to the approximation returned, it is important to determine which queries would not be approximated satisfactorily, in which case either the base data is accessed (if available) or the user is warned. In this paper the accuracy of approximations is determined experimentally for simple data reduction algorithms and several data sets. We show that data cube density and distribution skew are important parameters and large range queries are approximated much more accurately then point or small range queries. We quantify this and other results that should be taken into consideration when incorporating the data reduction techniques into the design.