Deferred maintenance of disk-based random samples

Authors:
Rainer Gemulla;Wolfgang Lehner
Affiliations:
Dresden University of Technology, Dresden, Germany;Dresden University of Technology, Dresden, Germany
Venue:
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Year:
2006

Citing 13
Cited 3

Faster methods for random sampling

Communications of the ACM
Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator

ACM Transactions on Modeling and Computer Simulation (TOMACS) - Special issue on uniform random number generation
Join synopses for approximate query answering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Materialized views: techniques, implementations, and applications

Materialized views: techniques, implementations, and applications
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Maintenance of Materialized Views of Sampling Queries

Proceedings of the Eighth International Conference on Data Engineering
Overcoming Limitations of Sampling for Aggregation Queries

Proceedings of the 17th International Conference on Data Engineering
ICICLES: Self-Tuning Samples for Approximate Query Answering

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dynamic sample selection for approximate query processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Online maintenance of very large random samples

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Load shedding in a data stream manager

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

Derby/S: a DBMS for sample-based query answering

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Random sampling is a well-known technique for approximate processing of large datasets. We introduce a set of algorithms for incremental maintenance of large random samples on secondary storage. We show that the sample maintenance cost can be reduced by refreshing the sample in a deferred manner. We introduce a novel type of log file which follows the intuition that only a “sample” of the operations on the base data has to be considered to maintain a random sample in a statistically correct way. Additionally, we develop a deferred refresh algorithm which updates the sample by using fast sequential disk access only, and which does not require any main memory. We conducted an extensive set of experiments and found, that our algorithms reduce maintenance cost by several orders of magnitude.