Stream sampling for variance-optimal estimation of subset sums

Authors:
Edith Cohen;Nick Duffield;Haim Kaplan;Carsten Lund;Mikkel Thorup
Affiliations:
AT&T Labs---Research, NJ;AT&T Labs---Research, NJ;Tel Aviv University, Tel Aviv, Israel;AT&T Labs---Research, NJ;AT&T Labs---Research, NJ
Venue:
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Year:
2009

Citing 18
Cited 8

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to Algorithms

Introduction to Algorithms
On the relationship between file sizes, transport protocols, and self-similar network traffic

ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Distributions on Level-Sets with Applications to Approximation Algorithms

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Inside the Slammer Worm

IEEE Security and Privacy
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Weighted random sampling with a reservoir

Information Processing Letters
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Bottom-k sketches: better and more efficient estimation of aggregates

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Equivalence between priority queues and sorting

Journal of the ACM (JACM)
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tight bounds for Lp samplers, finding duplicates in streams, and related problems

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Structure-aware sampling on data streams

Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Structure-aware sampling on data streams

ACM SIGMETRICS Performance Evaluation Review - Performance evaluation review
Fair sampling across network flow measurements

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any off-line scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.