Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to Algorithms
On the relationship between file sizes, transport protocols, and self-similar network traffic
ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Distributions on Level-Sets with Applications to Approximation Algorithms
FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
IEEE Security and Privacy
Sampling algorithms in a stream operator
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Weighted random sampling with a reservoir
Information Processing Letters
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Bottom-k sketches: better and more efficient estimation of aggregates
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Equivalence between priority queues and sorting
Journal of the ACM (JACM)
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
On the variance of subset sum estimation
ESA'07 Proceedings of the 15th annual European conference on Algorithms
Learn more, sample less: control of volume and variance in network measurement
IEEE Transactions on Information Theory
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Composable, scalable, and accurate weight summarization of unaggregated data sets
Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Structure-aware sampling on data streams
Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Structure-aware sampling on data streams
ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Fair sampling across network flow measurements
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Hi-index | 0.00 |
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.