Enhancing histograms by tree-like bucket indices

Authors:
Francesco Buccafurri;Gianluca Lax;Domenico Saccà;Luigi Pontieri;Domenico Rosaci
Affiliations:
DIMET Department, University "Mediterranea" of Reggio Calabria, Reggio Calabria, Italy;DIMET Department, University "Mediterranea" of Reggio Calabria, Reggio Calabria, Italy;DEIS Department, University of Calabria and ICAR-CNR, Rende, Italy;DEIS Department, University of Calabria and ICAR-CNR, Rende, Italy;DIMET Department, University "Mediterranea" of Reggio Calabria, Reggio Calabria, Italy
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2008

Citing 28
Cited 6

A universal-scheme approach to statistical databases containing homogeneous summary tables

ACM Transactions on Database Systems (TODS)
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Histogram-based estimation techniques in database systems

Histogram-based estimation techniques in database systems
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Implications of certain assumptions in database performance evauation

ACM Transactions on Database Systems (TODS)
Optimal histograms for hierarchical range queries (extended abstract)

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Global optimization of histograms

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Dynamic multidimensional histograms

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Using histograms to estimate answer sizes for XML queries

Information Systems - Special issue: Best papers from EDBT 2002
Rangesum histograms

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Histogramming Data Streams with Fast Per-Item Processing

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
Approximate query processing using wavelets

The VLDB Journal — The International Journal on Very Large Data Bases
Dynamic Histograms: Capturing Evolving Data Sets

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extended wavelets for multiple measures

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Improving Range Query Estimation on Histograms

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Estimating selectivities in data bases

Estimating selectivities in data bases
Probabilistic wavelet synopses

ACM Transactions on Database Systems (TODS)
One-pass wavelet synopses for maximum-error metrics

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

Optimality and scalability in lattice histogram construction

Proceedings of the VLDB Endowment
Approximating sliding windows by cyclic tree-like histograms for efficient range queries

Data & Knowledge Engineering
A quad-tree based multiresolution approach for two-dimensional summary data

Information Systems
A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms

Information Sciences: an International Journal
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Indexing for summary queries: Theory and practice

ACM Transactions on Database Systems (TODS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Histograms are used to summarize the contents of relations into a number of buckets for the estimation of query result sizes. Several techniques have been proposed in the past for determining bucket boundaries which provide accurate estimations. However, while search strategies for optimal bucket boundaries are rather sophisticated, no much attention has been paid for estimating queries inside buckets and all of the above techniques adopt naive methods for such an estimation. This paper focuses on the problem of improving the estimation inside a bucket once its boundaries have been fixed. The proposed technique is based on the addition, to each bucket, of a memory-word additional information (organized into a tree-like index), storing approximate cumulative frequencies in a hierarchical fashion. Both theoretical analysis and experimental results show that the proposed approach improves the accuracy of the estimation inside buckets, w.r.t. both classical approaches (like continuous value assumption and uniform spread assumption) and a number of alternative ways to organize the additional information. The index is later added to state-of-the-art histograms obtaining the non-obvious result that despite the spatial overhead which reduces the number of allowed buckets once the storage space has been fixed, the original methods are strongly improved in terms of accuracy.