Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to Algorithms
On the relationship between file sizes, transport protocols, and self-similar network traffic
ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
IEEE Security and Privacy
Flow sampling under hard resource constraints
Proceedings of the joint international conference on Measurement and modeling of computer systems
Estimating arbitrary subset sums with few probes
Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms in a stream operator
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal
Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Confidence intervals for priority sampling
SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Data streams: algorithms and applications
Foundations and Trends® in Theoretical Computer Science
Optimal combination of sampled network measurements
IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Bottom-k sketches: better and more efficient estimation of aggregates
Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Sketching unaggregated data streams for subpopulation-size queries
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Equivalence between priority queues and sorting
Journal of the ACM (JACM)
On the variance of subset sum estimation
ESA'07 Proceedings of the 15th annual European conference on Algorithms
Learn more, sample less: control of volume and variance in network measurement
IEEE Transactions on Information Theory
Tighter estimation using bottom k sketches
Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums
SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates
Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Distinct-value synopses for multiset operations
Communications of the ACM - A View of Parallel Computing
Composable, scalable, and accurate weight summarization of unaggregated data sets
Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments
Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Detecting adversarial advertisements in the wild
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums
SIAM Journal on Computing
Fair sampling across network flow measurements
Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Differentially private summaries for sparse data
Proceedings of the 15th International Conference on Database Theory
Statistical distortion: consequences of data cleaning
Proceedings of the VLDB Endowment
Content placement via the exponential potential function method
IPCO'13 Proceedings of the 16th international conference on Integer Programming and Combinatorial Optimization
Bottom-k and priority sampling, set similarity and subset sums with minimal independence
Proceedings of the forty-fifth annual ACM symposium on Theory of computing
Hi-index | 0.00 |
From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items i = 0,…,n − 1 with weights wi. For each item i, we generate a random number αi ∈ (0,1] and create a priority qi = wi/αi. The sample S consists of the k highest priority items. Let τ be the (k + 1)th highest priority. Each sampled item i in S gets a weight estimate ŵi = max{wi, τ}, while nonsampled items get weight estimate ŵi = 0. Magically, it turns out that the weight estimates are unbiased, that is, E[ŵi] = wi, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates ŵi and ŵj of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller variance sum than priority sampling with k + 1 items. Szegedy settled this conjecture at STOC'06.