Random sampling for histogram construction: how much is enough?

Authors:
Surajit Chaudhuri;Rajeev Motwani;Vivek Narasayya
Affiliations:
Microsoft Research;Stanford University;Microsoft Research
Venue:
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Year:
1998

Citing 18
Cited 87

Physical database design for relational databases

ACM Transactions on Database Systems (TODS)
Processing aggregate relational queries with hard time constraints

SIGMOD '89 Proceedings of the 1989 ACM SIGMOD international conference on Management of data
Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
On estimating the size of projections

ICDT '90 Proceedings of the third international conference on database theory on Database theory
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Efficient sampling strategies for relational database operations

ICDT Selected papers of the 4th international conference on Database theory
Randomized algorithms

Randomized algorithms
Balancing histogram optimality and practicality for query result size estimation

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling (extended abstract)

PODS '90 Proceedings of the ninth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Access path selection in a relational database management system

SIGMOD '79 Proceedings of the 1979 ACM SIGMOD international conference on Management of data
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
An Evaluation of Sampling-Based Size Estimation Methods for Selections in Database Systems

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

An overview of query optimization in relational systems

PODS '98 Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
AutoAdmin “what-if” index analysis utility

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A comparison of selectivity estimators for range queries on metric attributes

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Random sampling techniques for space efficient online computation of order statistics of large datasets

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Synopsis data structures for massive data sets

Proceedings of the tenth annual ACM-SIAM symposium on Discrete algorithms
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Approximating multi-dimensional aggregate range queries over real attributes

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Sampling from databases using B+-trees

Proceedings of the ninth international conference on Information and knowledge management
Optimal and approximate computation of summary statistics for range aggregates

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Data-streams and histograms

STOC '01 Proceedings of the thirty-third annual ACM symposium on Theory of computing
Support vector machine active learning for image retrieval

MULTIMEDIA '01 Proceedings of the ninth ACM international conference on Multimedia
New directions in traffic measurement and accounting

IMW '01 Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
On Issues of Instance Selection

Data Mining and Knowledge Discovery
Automating Statistics Management for Query Optimizers

IEEE Transactions on Knowledge and Data Engineering
Probabilistic Optimization of Top N Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Online Dynamic Reordering for Interactive Data Processing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-size Estimation

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate Query Processing: Taming the TeraBytes

Proceedings of the 27th International Conference on Very Large Data Bases
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Online dynamic reordering

The VLDB Journal — The International Journal on Very Large Data Bases
Comparing Data Streams Using Hamming Norms (How to Zero In)

IEEE Transactions on Knowledge and Data Engineering
Data reduction: sampling

Handbook of data mining and knowledge discovery
A Pareto model for OLAP view size estimation

CASCON '01 Proceedings of the 2001 conference of the Centre for Advanced Studies on Collaborative research
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
A Selectivity Model for Fragmented Relations: Applied in Information Retrieval

IEEE Transactions on Knowledge and Data Engineering
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Query sampling in DB2 Universal Database

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Building a better NetFlow

Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications
Approximately uniform random sampling in sensor networks

DMSN '04 Proceeedings of the 1st international workshop on Data management for sensor networks: in conjunction with VLDB 2004
Maintaining Implicated Statistics in Constrained Environments

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Selectivity estimators for multidimensional range queries over real attributes

The VLDB Journal — The International Journal on Very Large Data Bases
A robust system for accurate real-time summaries of internet traffic

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Fast and approximate stream mining of quantiles and frequencies using graphics processors

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
The space complexity of pass-efficient algorithms for clustering

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
Synopses for query optimization: A space-complexity perspective

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Approximate quantiles and the order of the stream

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
High-throughput sketch update on a low-power stream processor

Proceedings of the 2006 ACM/IEEE symposium on Architecture for networking and communications systems
Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more

Physical Database Design: the database professional's guide to exploiting indexes, views, storage, and more
Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
The power of slicing in internet flow measurement

IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Towards higher disk head utilization: extracting free bandwidth from busy disk drives

OSDI'00 Proceedings of the 4th conference on Symposium on Operating System Design & Implementation - Volume 4
Value and Relation Display: Interactive Visual Exploration of Large Data Sets with Hundreds of Dimensions

IEEE Transactions on Visualization and Computer Graphics
Efficient Approximate Query Processing in Peer-to-Peer Networks

IEEE Transactions on Knowledge and Data Engineering
Comparing data streams using Hamming norms (how to zero in)

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Sketch Algorithm for Estimating Two-Way and Multi-Way Associations

Computational Linguistics
Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems (TODS)
Sampling from databases using B$^+$-Trees

Intelligent Data Analysis
The history of histograms (abridged)

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Bloom histogram: path selectivity estimation for XML data with updates

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Self-tuning database systems: a decade of progress

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Distinct value estimation on peer-to-peer networks

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
A survey of top-k query processing techniques in relational database systems

ACM Computing Surveys (CSUR)
Distributed hash sketches: Scalable, efficient, and accurate cardinality estimation for distributed multisets

ACM Transactions on Computer Systems (TOCS)
A quality-aware optimizer for information extraction

ACM Transactions on Database Systems (TODS)
A divide-and-conquer recursive approach for scaling up instance selection algorithms

Data Mining and Knowledge Discovery
by chance enhancing interaction with large data sets through statistical sampling

Proceedings of the Working Conference on Advanced Visual Interfaces
Statistical structures for Internet-scale data management

The VLDB Journal — The International Journal on Very Large Data Bases
Building data synopses within a known maximum error bound

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Result-size estimation for information-retrieval subqueries

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A sample advisor for approximate query processing

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Data integration with dependent sources

Proceedings of the 14th International Conference on Extending Database Technology
Real-time approximate Range Motif discovery & data redundancy removal algorithm

Proceedings of the 14th International Conference on Extending Database Technology
Optimizing data partitioning for data-parallel computing

HotOS'13 Proceedings of the 13th USENIX conference on Hot topics in operating systems
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Intelligent statistics management in sybase ASE 15.0

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
HASE: a hybrid approach to selectivity estimation for conjunctive predicates

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
The pipelined set cover problem

ICDT'05 Proceedings of the 10th international conference on Database Theory
A scalable supervised algorithm for dimensionality reduction on streaming data

Information Sciences: an International Journal
Processing count queries over event streams at multiple time granularities

Information Sciences: an International Journal
Approximating and testing k-histogram distributions in sub-linear time

PODS '12 Proceedings of the 31st symposium on Principles of Database Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
HEDC: a histogram estimator for data in the cloud

Proceedings of the fourth international workshop on Cloud data management
Efficiently adapting graphical models for selectivity estimation

The VLDB Journal — The International Journal on Very Large Data Bases
Optimus: a dynamic rewriting framework for data-parallel execution plans

Proceedings of the 8th ACM European Conference on Computer Systems
Manipulation of Training Sets for Improving Data Mining Coverage-Driven Verification

Journal of Electronic Testing: Theory and Applications
Indexing for summary queries: Theory and practice

ACM Transactions on Database Systems (TODS)
Estimating duplication by content-based sampling

USENIX ATC'13 Proceedings of the 2013 USENIX conference on Annual Technical Conference
PREDIcT: towards predicting the runtime of large scale iterative analytics

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random sampling is a standard technique for constructing (approximate) histograms for query optimization. However, any real implementation in commercial products requires solving the hard problem of determining “How much sampling is enough?” We address this critical question in the context of equi-height histograms used in many commercial products, including Microsoft SQL Server. We introduce a conservative error metric capturing the intuition that for an approximate histogram to have low error, the error must be small in all regions of the histogram. We then present a result establishing an optimal bound on the amount of sampling required for pre-specified error bounds. We also describe an adaptive page sampling algorithm which achieves greater efficiency by using all values in a sampled page but adjusts the amount of sampling depending on clustering of values in pages. Next, we establish that the problem of estimating the number of distinct values is provably difficult, but propose a new error metric which has a reliable estimator and can still be exploited by query optimizers to influence the choice of execution plans. The algorithm for histogram construction was prototyped on Microsoft SQL Server 7.0 and we present experimental results showing that the adaptive algorithm accurately approximates the true histogram over different data distributions.