Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
A probabilistic algorithm for the post office problem
STOC '85 Proceedings of the seventeenth annual ACM symposium on Theory of computing
Applications of random sampling in computational geometry, II
SCG '88 Proceedings of the fourth annual symposium on Computational geometry
The power of sampling in knowledge discovery
PODS '94 Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Random sampling in graph optimization problems
Random sampling in graph optimization problems
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Fast incremental maintenance of approximate histograms
ACM Transactions on Database Systems (TODS)
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules
Data Mining and Knowledge Discovery
Sampling from Spatial Databases
Proceedings of the Ninth International Conference on Data Engineering
ICICLES: Self-Tuning Samples for Approximate Query Answering
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Approximate join processing over data streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Query Estimation by Adaptive Sampling
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Load Shedding for Aggregation Queries over Data Streams
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Online maintenance of very large random samples
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Sampling algorithms in a stream operator
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Techniques for Warehousing of Sample Data
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A dip in the reservoir: maintaining sample synopses of evolving datasets
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
On biased reservoir sampling in the presence of stream evolution
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Optimized stratified sampling for approximate query processing
ACM Transactions on Database Systems (TODS)
Adaptive-Size Reservoir Sampling over Data Streams
SSDBM '07 Proceedings of the 19th International Conference on Scientific and Statistical Database Management
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Memory-limited execution of windowed stream joins
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Maintaining bounded-size sample synopses of evolving datasets
The VLDB Journal — The International Journal on Very Large Data Bases
Robust Stratified Sampling Plans for Low Selectivity Queries
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Optimal sampling from distributed streams
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Structure-aware sampling on data streams
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Mining frequent patterns across multiple data streams
Proceedings of the 20th ACM international conference on Information and knowledge management
Data warehousing and knowledge discovery from sensors and streams
Knowledge and Information Systems - Special Issue on Data Warehousing and Knowledge Discovery from Sensors and Streams
Hi-index | 0.00 |
Reservoir sampling is a known technique for maintaining a random sample of a fixed size over a data stream of an unknown size. While reservoir sampling is suitable for applications demanding a sample over the whole data stream, it is not designed for applications in which an input stream is composed of sub-streams with heterogeneous statistical properties. For this class of applications, the conventional reservoir sampling technique can lead to a potential damage in the statistical quality of the sample because it does not guarantee the inclusion of a statistically sufficient number of tuples in the sample from each sub-stream. In this paper, we address this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. This stratification poses two challenges. First, a fixed-size reservoir should be allocated to individual sub-streams optimally, specifically to have the stratified reservoir sample used to generate estimates at the level of either the whole data stream or the individual sub-streams. Second, the allocation should be adjusted adaptively if and when new sub-streams appear in or existing sub-streams disappear from the input stream and as their statistical properties change. We propose a novel adaptive stratified reservoir sampling algorithm designed to meet these challenges. An extensive performance study shows the superiority of the achieved sample quality and demonstrates the adaptivity of the proposed sampling algorithm.