Faster methods for random sampling
Communications of the ACM
Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
The power of sampling in knowledge discovery
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Improved histograms for selectivity estimation of range predicates
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Fast incremental maintenance of approximate histograms
ACM Transactions on Database Systems (TODS)
Maintenance of Materialized Views of Sampling Queries
Proceedings of the Eighth International Conference on Data Engineering
A bi-level Bernoulli scheme for database sampling
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Online maintenance of very large random samples
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
CORDS: automatic discovery of correlations and soft functional dependencies
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Sampling in dynamic data streams and applications
SCG '05 Proceedings of the twenty-first annual symposium on Computational geometry
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
VLDB '05 Proceedings of the 31st international conference on Very large data bases
Techniques for Warehousing of Sample Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Load shedding in a data stream manager
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
BHUNT: automatic discovery of Fuzzy algebraic constraints in relational data
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Deferred maintenance of disk-based random samples
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Toward automated large-scale information integration and discovery
Data Management in a Connected World
Cardinality estimation using sample views with quality assurance
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maintaining bernoulli samples over evolving multisets
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficient and scalable statistics gathering for large databases in Oracle 11g
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Estimating the confidence of conditional functional dependencies
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
The VC-dimension of SQL queries and selectivity estimation through sampling
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
A simple, yet effective and efficient, sliding window sampling algorithm
DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Don't let the negatives bring you down: sampling from streams of signed updates
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Efficient sampling of non-strict turnstile data streams
FCT'13 Proceedings of the 19th international conference on Fundamentals of Computation Theory
Adaptive stratified reservoir sampling over heterogeneous data streams
Information Systems
Hi-index | 0.00 |
Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we study methods for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions. For "stable" datasets whose size remains roughly constant over time, we provide a novel sampling scheme, called "random pairing" (RP) which maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions. Experiments show that, when dataset-size fluctuations over time are not too extreme, RP is the algorithm of choice with respect to speed and sample-size stability. For "growing" datasets, we consider algorithms for periodically "resizing" a bounded-size random sample upwards. We prove that any such algorithm cannot avoid accessing the base data, and provide a novel resizing algorithm that minimizes the time needed to increase the sample size.