Effective use of block-level sampling in statistics estimation

Authors:
Surajit Chaudhuri;Gautam Das;Utkarsh Srivastava
Affiliations:
Microsoft Research;Microsoft Research;Stanford University
Venue:
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Year:
2004

Citing 10
Cited 26

Practical selectivity estimation through adaptive sampling

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Random sampling for histogram construction: how much is enough?

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Statistical estimators for relational algebra expressions

Proceedings of the seventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Towards estimation error guarantees for distinct values

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Accurate estimation of the number of tuples satisfying a condition

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Sampling-Based Estimation of the Number of Distinct Values of an Attribute

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Online Random Shuffling of Large Database Tables

IEEE Transactions on Knowledge and Data Engineering
Random Sampling for Continuous Streams with Arbitrary Updates

IEEE Transactions on Knowledge and Data Engineering
A random walk approach to sampling hidden databases

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Efficient Approximate Query Processing in Peer-to-Peer Networks

IEEE Transactions on Knowledge and Data Engineering
Self-tuning database systems: a decade of progress

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Pay-as-you-go user feedback for dataspace systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Distinct value estimation on peer-to-peer networks

Proceedings of the 1st international conference on PErvasive Technologies Related to Assistive Environments
Tagmark: reliable estimations of RFID tags for business processes

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Hashed samples: selectivity estimators for set similarity selection queries

Proceedings of the VLDB Endowment
Sample synopses for approximate answering of group-by queries

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Tuning database configuration parameters with iTuned

Proceedings of the VLDB Endowment
Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Online monitoring and visualisation of database structural deterioration

International Journal of Autonomic Computing
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Result-size estimation for information-retrieval subqueries

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Just-in-time analytics on large file systems

FAST'11 Proceedings of the 9th USENIX conference on File and stroage technologies
HASE: a hybrid approach to selectivity estimation for conjunctive predicates

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Early accurate results for advanced analytics on MapReduce

Proceedings of the VLDB Endowment
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
HEDC: a histogram estimator for data in the cloud

Proceedings of the fourth international workshop on Cloud data management
Balancing reducer skew in MapReduce workloads using progressive sampling

Proceedings of the Third ACM Symposium on Cloud Computing
You can stop early with COLA: online processing of aggregate queries in the cloud

Proceedings of the 21st ACM international conference on Information and knowledge management
Towards realistic sampling: generating dependencies in a relational database

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Indexing for summary queries: Theory and practice

ACM Transactions on Database Systems (TODS)
Can we analyze big data inside a DBMS?

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Scalable progressive analytics on big data in the cloud

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Block-level sampling is far more efficient than true uniform-random sampling over a large database, but prone to significant errors if used to create database statistics. In this paper, we develop principled approaches to overcome this limitation of block-level sampling for histograms as well as distinct-value estimations. For histogram construction, we give a novel two-phase adaptive method in which the sample size required to reach a desired accuracy is decided based on a first phase sample. This method is significantly faster than previous iterative methods proposed for the same problem. For distinct-value estimation, we show that existing estimators designed for uniform-random samples may perform very poorly if used directly on block-level samples. We present a key technique that computes an appropriate subset of a block-level sample that is suitable for use with most existing estimators. This, to the best of our knowledge, is the first principled method for distinct-value estimation with block-level samples. We provide extensive experimental results validating our methods.