Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

Authors:
Edith Cohen;Nick Duffield;Haim Kaplan;Carsten Lund;Mikkel Thorup
Affiliations:
edith@research.att.com and duffield@research.att.com and lund@research.att.com and mthorup@research.att.com;-;haimk@cs.tau.ac.il;-;-
Venue:
SIAM Journal on Computing
Year:
2011

Citing 26
Cited 1

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
The cell probe complexity of dynamic data structures

STOC '89 Proceedings of the twenty-first annual ACM symposium on Theory of computing
Randomized Distributed Edge Coloring via an Extension of the Chernoff--Hoeffding Bounds

SIAM Journal on Computing
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Balls and bins: a study in negative dependence

Random Structures & Algorithms
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to Algorithms

Introduction to Algorithms
New directions in traffic measurement and accounting

Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
Marked Ancestor Problems

FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
On the relationship between file sizes, transport protocols, and self-similar network traffic

ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Distributions on Level-Sets with Applications to Approximation Algorithms

FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
Inside the Slammer Worm

IEEE Security and Privacy
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Confidence intervals for priority sampling

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Bottom-k sketches: better and more efficient estimation of aggregates

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Summarizing data using bottom-k sketches

Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Equivalence between priority queues and sorting

Journal of the ACM (JACM)
Priority sampling for estimation of arbitrary subset sums

Journal of the ACM (JACM)
Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
Weighted random sampling with a reservoir

Information Processing Letters
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $\textnormal{\sc VarOpt$k$}$, that dominates all previous schemes in terms of estimation quality. $\textnormal{\sc VarOpt$k$}$ provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen $n$ items of the stream, then for any subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in $O(\log k)$ time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.