Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions
Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Overcoming Limitations of Sampling for Aggregation Queries
Proceedings of the 17th International Conference on Data Engineering
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Embedded predictive modeling in a parallel relational database
Proceedings of the 2006 ACM symposium on Applied computing
A dip in the reservoir: maintaining sample synopses of evolving datasets
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Online Random Shuffling of Large Database Tables
IEEE Transactions on Knowledge and Data Engineering
Random Sampling for Continuous Streams with Arbitrary Updates
IEEE Transactions on Knowledge and Data Engineering
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Cardinality estimation using sample views with quality assurance
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
A random walk approach to sampling hidden databases
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient Approximate Query Processing in Peer-to-Peer Networks
IEEE Transactions on Knowledge and Data Engineering
Hashed samples: selectivity estimators for set similarity selection queries
Proceedings of the VLDB Endowment
Sampling dirty data for matching attributes
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Estimating set intersection using small samples
ACSC '10 Proceedings of the Thirty-Third Australasian Conferenc on Computer Science - Volume 102
Deferred maintenance of disk-based random samples
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
HASE: a hybrid approach to selectivity estimation for conjunctive predicates
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Byte-precision level of detail processing for variable precision analytics
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
You can stop early with COLA: online processing of aggregate queries in the cloud
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
Current database sampling methods give the user insufficient control when processing ISO-style sampling queries. To address this problem, we provide a bi-level Bernoulli sampling scheme that combines the row-level and page-level sampling methods currently used in most commercial systems. By adjusting the parameters of the method, the user can systematically trade off processing speed and statistical precision---the appropriate choice of parameter settings becomes a query optimization problem. We indicate the SQL extensions needed to support bi-level sampling and determine the optimal parameter settings for an important class of sampling queries with explicit time or accuracy constraints. As might be expected, row-level sampling is preferable when data values on each page are homogeneous, whereas page-level sampling should be used when data values on a page vary widely. Perhaps surprisingly, we show that in many cases the optimal sampling policy is of the "bang-bang" type: we identify a "page-heterogeneity index" (PHI) such that optimal sampling is as "row-like" as possible if the PHI is less than 1 and as "page-like" as possible otherwise. The PHI depends upon both the query and the data, and can be estimated by means of a pilot sample. Because pilot sampling can be nontrivial to implement in commercial database systems, we also give a heuristic method for setting the sampling parameters; the method avoids pilot sampling by using a small number of summary statistics that are maintained in the system catalog. Results from over 1100 experiments on 372 real and synthetic data sets show that the heuristic method performs optimally about half of the time, and yields sampling errors within a factor of 2.2 of optimal about 93% of the time. The heuristic method is stable over a wide range of sampling rates and performs best in the most critical cases, where the data is highly clustered or skewed.