Selectivity estimators for multidimensional range queries over real attributes

Authors:
Dimitrios Gunopulos;George Kollios;J. Tsotras;Carlotta Domeniconi
Affiliations:
Department of Computer Science and Engineering, Bourns College of Engineering, University of California, Riverside, USA;Department of Computer Science, Boston University, USA;Department of Computer Science and Engineering, Bourns College of Engineering, University of California, Riverside, USA;Department of Information and Software Engineering, George Mason University, USA
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2005

Citing 30
Cited 26

Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Wavelet-based histograms for selectivity estimation

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Data cube approximation and histograms via wavelets

Proceedings of the seventh international conference on Information and knowledge management
Selectivity estimation in spatial databases

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Self-tuning histograms: building histograms without looking at data

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Multi-dimensional selectivity estimation using compressed histogram information

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A comparison of selectivity estimators for range queries on metric attributes

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Compressed data cubes for OLAP aggregate query approximation on continuous dimensions

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
On approximating rectangle tiling and packing

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
STHoles: a multidimensional workload-aware histogram

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Optimal Histograms with Quality Guarantees

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Evaluating Top-k Selection Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Probabilistic Optimization of Top N Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Dynamic Maintenance of Wavelet-Based Histograms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Selectivity Estimation Without the Attribute Value Independence Assumption

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Approximate Answers to Aggregate Queries on a Data Cube

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Range Selectivity Estimation for Continuous Attributes

SSDBM '99 Proceedings of the 11th International Conference on Scientific and Statistical Database Management
Random sampling from database files: a survey

SSDBM'1990 Proceedings of the 5th international conference on Statistical and Scientific Database Management

Genetic algorithms for approximate similarity queries

Data & Knowledge Engineering
Efficiently answering top-k typicality queries on large databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Hierarchical synopses with optimal error guarantees

ACM Transactions on Database Systems (TODS)
Compressed hierarchical binary histograms for summarizing multi-dimensional data

Knowledge and Information Systems
A Probabilistic Framework for Building Privacy-Preserving Synopses of Multi-dimensional Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
A new approach to building histogram for selectivity estimation in query processing optimization

Computers & Mathematics with Applications
Multiplicative synopses for relative-error metrics

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Top-k typicality queries and efficient query answering methods on large databases

The VLDB Journal — The International Journal on Very Large Data Bases
Kernel-based skyline cardinality estimation

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimality and scalability in lattice histogram construction

Proceedings of the VLDB Endowment
A top-down approach for compressing data cubes under the simultaneous evaluation of multiple hierarchical range queries

Journal of Intelligent Information Systems
Hierarchically organized skew-tolerant histograms for geographic data objects

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
SQL query space and time complexity estimation for multidimensional queries

International Journal of Intelligent Information and Database Systems
Propagation of densities of streaming data within query graphs

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
A quad-tree based multiresolution approach for two-dimensional summary data

Information Systems
OLAP over continuous domains via density-based hierarchical clustering

KES'11 Proceedings of the 15th international conference on Knowledge-based and intelligent information and engineering systems - Volume Part II
Efficient construction of histograms for multidimensional data using quad-trees

Decision Support Systems
A probabilistic framework for estimating the accuracy of aggregate range queries evaluated over histograms

Information Sciences: an International Journal
Exploiting cluster analysis for constructing multi-dimensional histograms on both static and evolving data

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Clustering-based histograms for multi-dimensional data

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery
Efficient approximation of the maximal preference scores by lightweight cubic views

Proceedings of the 15th International Conference on Extending Database Technology
Improving the accuracy of histograms for geographic data objects

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Authentication of moving range queries

Proceedings of the 21st ACM international conference on Information and knowledge management
STHist-C: a highly accurate cluster-based histogram for two and three dimensional geographic data points

Geoinformatica
RFID-data compression for supporting aggregate queries

ACM Transactions on Database Systems (TODS)
Bichromatic buckets: An effective technique to improve the accuracy of histograms for geographic data points

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Estimating the selectivity of multidimensional range queries over real valued attributes has significant applications in data exploration and database query optimization. In this paper, we consider the following problem: given a table of d attributes whose domain is the real numbers and a query that specifies a range in each dimension, find a good approximation of the number of records in the table that satisfy the query. The simplest approach to tackle this problem is to assume that the attributes are independent. More accurate estimators try to capture the joint data distribution of the attributes. In databases, such estimators include the construction of multidimensional histograms, random sampling, or the wavelet transform. In statistics, kernel estimation techniques are being used. Many traditional approaches assume that attribute values come from discrete, finite domains, where different values have high frequencies. However, for many novel applications (as in temporal, spatial, and multimedia databases) attribute values come from the infinite domain of real numbers. Consequently, each value appears very infrequently, a characteristic that affects the behavior and effectiveness of the estimator. Moreover, real-life data exhibit attribute correlations that also affect the estimator. We present a new histogram technique that is designed to approximate the density of multidimensional datasets with real attributes. Our technique defines buckets of variable size and allows the buckets to overlap. The size of the cells is based on the local density of the data. The use of overlapping buckets allows a more compact approximation of the data distribution. We also show how to generalize kernel density estimators and how to apply them to the multidimensional query approximation problem. Finally, we compare the accuracy of the proposed techniques with existing techniques using real and synthetic datasets. The experimental results show that the proposed techniques behave more accurately in high dimensionalities than previous approaches.