A dip in the reservoir: maintaining sample synopses of evolving datasets

Authors:
Rainer Gemulla;Wolfgang Lehner;Peter J. Haas
Affiliations:
Technische Universität Dresden, Dresden, Germany;Technische Universität Dresden, Dresden, Germany;IBM Almaden Research Center, San Jose, California
Venue:
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Year:
2006

Citing 20
Cited 10

Faster methods for random sampling

Communications of the ACM
Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
The power of sampling in knowledge discovery

PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Improved histograms for selectivity estimation of range predicates

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Fast incremental maintenance of approximate histograms

ACM Transactions on Database Systems (TODS)
Maintenance of Materialized Views of Sampling Queries

Proceedings of the Eighth International Conference on Data Engineering
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Sampling in dynamic data streams and applications

SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Techniques for Warehousing of Sample Data

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Load shedding in a data stream manager

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Deferred maintenance of disk-based random samples

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Toward automated large-scale information integration and discovery

Data Management in a Connected World

Cardinality estimation using sample views with quality assurance

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maintaining bernoulli samples over evolving multisets

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient and scalable statistics gathering for large databases in Oracle 11g

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
A simple, yet effective and efficient, sliding window sampling algorithm

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Don't let the negatives bring you down: sampling from streams of signed updates

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Efficient sampling of non-strict turnstile data streams

FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For "stable" datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called "random pairing" (RP) which maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For "growing" datasets, we consider algorithms for periodically "resizing" a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size.