Priority sampling for estimation of arbitrary subset sums

Authors:
Nick Duffield;Carsten Lund;Mikkel Thorup
Affiliations:
AT&T Labs--Research, Florham Park, New Jersey;AT&T Labs--Research, Florham Park, New Jersey;AT&T Labs--Research, Florham Park, New Jersey
Venue:
Journal of the ACM (JACM)
Year:
2007

Citing 20
Cited 16

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
On random sampling over joins

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Introduction to Algorithms

Introduction to Algorithms
On the relationship between file sizes, transport protocols, and self-similar network traffic

ICNP '96 Proceedings of the 1996 International Conference on Network Protocols (ICNP '96)
Inside the Slammer Worm

IEEE Security and Privacy
Flow sampling under hard resource constraints

Proceedings of the joint international conference on Measurement and modeling of computer systems
Estimating arbitrary subset sums with few probes

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
The DLT priority sampling is essentially optimal

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Confidence intervals for priority sampling

SIGMETRICS '06/Performance '06 Proceedings of the joint international conference on Measurement and modeling of computer systems
Data streams: algorithms and applications

Foundations and Trends® in Theoretical Computer Science
Optimal combination of sampled network measurements

IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Bottom-k sketches: better and more efficient estimation of aggregates

Proceedings of the 2007 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Sketching unaggregated data streams for subpopulation-size queries

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Equivalence between priority queues and sorting

Journal of the ACM (JACM)
On the variance of subset sum estimation

ESA'07 Proceedings of the 15th annual European conference on Algorithms
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

Tighter estimation using bottom k sketches

Proceedings of the VLDB Endowment
Stream sampling for variance-optimal estimation of subset sums

SODA '09 Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms
Leveraging discarded samples for tighter estimation of multiple-set aggregates

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Composable, scalable, and accurate weight summarization of unaggregated data sets

Proceedings of the VLDB Endowment
Coordinated weighted sampling for estimating aggregates over multiple weight assignments

Proceedings of the VLDB Endowment
Get the most out of your sample: optimal unbiased estimators using partial information

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tight bounds for Lp samplers, finding duplicates in streams, and related problems

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Detecting adversarial advertisements in the wild

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient Stream Sampling for Variance-Optimal Estimation of Subset Sums

SIAM Journal on Computing
Bootstrapping Boosted Random Ferns for discriminative and efficient object classification

Pattern Recognition
Fair sampling across network flow measurements

Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE joint international conference on Measurement and Modeling of Computer Systems
Differentially private summaries for sparse data

Proceedings of the 15th International Conference on Database Theory
Statistical distortion: consequences of data cleaning

Proceedings of the VLDB Endowment
Content placement via the exponential potential function method

IPCO'13 Proceedings of the 16th international conference on Integer Programming and Combinatorial Optimization
Bottom-k and priority sampling, set similarity and subset sums with minimal independence

Proceedings of the forty-fifth annual ACM symposium on Theory of computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items i = 0,…,n − 1 with weights wi. For each item i, we generate a random number αi ∈ (0,1] and create a priority qi = wi/αi. The sample S consists of the k highest priority items. Let τ be the (k + 1)th highest priority. Each sampled item i in S gets a weight estimate ŵi = max{wi, τ}, while nonsampled items get weight estimate ŵi = 0. Magically, it turns out that the weight estimates are unbiased, that is, E[ŵi] = wi, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates ŵi and ŵj of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller variance sum than priority sampling with k + 1 items. Szegedy settled this conjecture at STOC'06.