Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
The cell probe complexity of dynamic data structures
STOC '89 Proceedings of the twenty-first annual ACM symposium on Theory of computing
Randomized Distributed Edge Coloring via an Extension of the Chernoff--Hoeffding Bounds
SIAM Journal on Computing
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Balls and bins: a study in negative dependence
Random Structures & Algorithms
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to Algorithms
New directions in traffic measurement and accounting
Proceedings of the 2002 conference on Applications, technologies, architectures, and protocols for computer communications
FOCS '98 Proceedings of the 39th Annual Symposium on Foundations of Computer Science
On the relationship between file sizes, transport protocols, and self-similar network traffic
ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Distributions on Level-Sets with Applications to Approximation Algorithms
FOCS '01 Proceedings of the 42nd IEEE symposium on Foundations of Computer Science
IEEE Security and Privacy
Sampling algorithms in a stream operator
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Confidence intervals for priority sampling
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Bottom-k sketches: better and more efficient estimation of aggregates
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Summarizing data using bottom-k sketches
Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing
Equivalence between priority queues and sorting
Journal of the ACM (JACM)
Priority sampling for estimation of arbitrary subset sums
Journal of the ACM (JACM)
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
Composable, scalable, and accurate weight summarization of unaggregated data sets
Proceedings of the VLDB Endowment
Weighted random sampling with a reservoir
Information Processing Letters
On the variance of subset sum estimation
ESA'07 Proceedings of the 15th annual European conference on Algorithms
Learn more, sample less: control of volume and variance in network measurement
IEEE Transactions on Information Theory
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Hi-index | 0.00 |
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, $\textnormal{\sc VarOpt$k$}$, that dominates all previous schemes in terms of estimation quality. $\textnormal{\sc VarOpt$k$}$ provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen $n$ items of the stream, then for any subset size $m$, our scheme based on $k$ samples minimizes the average variance over all subsets of size $m$. In fact, the optimality is against any off-line scheme with $k$ samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worst-case bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in $O(\log k)$ time. Finally, it is particularly well suited for combinations of samples from different streams in a distributed setting.