Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Sampling from Spatial Databases
Proceedings of the Ninth International Conference on Data Engineering
Query Estimation by Adaptive Sampling
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Techniques for Warehousing of Sample Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
On biased reservoir sampling in the presence of stream evolution
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
Adaptive-Size Reservoir Sampling over Data Streams
SSDBM '07 Proceedings of the 19th International Conference on Scientific and Statistical Database Management
Memory-limited execution of windowed stream joins
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Maintaining bounded-size sample synopses of evolving datasets
The VLDB Journal — The International Journal on Very Large Data Bases
Robust Stratified Sampling Plans for Low Selectivity Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Hi-index | 0.00 |
Reservoir sampling is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally heterogeneous, i.e., composed of sub-streams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tuples from each substream to be included in the reservoir, and this can cause a damage on the statistical quality of the sample. In this paper, we deal with this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. We particularly consider situations in which the stratified reservoir sample is needed to obtain reliable estimates at the level of either the entire data stream or individual sub-streams. The first challenge in this stratification is to achieve an optimal allocation of a fixed-size reservoir to individual sub-streams. The second challenge is to adaptively adjust the allocation as sub-streams appear in, or disappear from, the input stream and as their statistical properties change over time. We present a stratified reservoir sampling algorithm designed to meet these challenges, and demonstrate through experiments the superior sample quality and the adaptivity of the algorithm.