A bi-level Bernoulli scheme for database sampling

Authors:
Peter J. Haas;Christian König
Affiliations:
IBM Almaden Research Center;Berufsakademie Stuttgart/IBM Germany
Venue:
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Year:
2004

Citing 7
Cited 16

Bifocal sampling for skew-resistant join size estimation

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases

Embedded predictive modeling in a parallel relational database

Proceedings of the 2006 ACM symposium on Applied computing
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Online Random Shuffling of Large Database Tables

IEEE Transactions on Knowledge and Data Engineering
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more

Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient Approximate Query Processing in Peer-to-Peer Networks

IEEE Transactions on Knowledge and Data Engineering
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating set intersection using small samples

ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Deferred maintenance of disk-based random samples

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
HASE: a hybrid approach to selectivity estimation for conjunctive predicates

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Byte-precision level of detail processing for variable precision analytics

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed.