Priority sampling for estimation of arbitrary subset sums

  • Authors:
  • Nick Duffield;Carsten Lund;Mikkel Thorup

  • Affiliations:
  • AT&T Labs--Research, Florham Park, New Jersey;AT&T Labs--Research, Florham Park, New Jersey;AT&T Labs--Research, Florham Park, New Jersey

  • Venue:
  • Journal of the ACM (JACM)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

From a high-volume stream of weighted items, we want to create a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets. Applied to Internet traffic analysis, the items could be records summarizing the flows of packets streaming by a router. Subsets could be flow records from different time intervals of a worm attack whose signature is later determined. The samples taken in the past thus allow us to trace the history of the attack even though the worm was unknown at the time of sampling. Estimation from the samples must be accurate even with heavy-tailed distributions where most of the weight is concentrated on a few heavy items. We want the sample to be weight sensitive, giving priority to heavy items. At the same time, we want sampling without replacement in order to avoid selecting heavy items multiple times. To fulfill these requirements we introduce priority sampling, which is the first weight-sensitive sampling scheme without replacement that works in a streaming context and is suitable for estimating subset sums. Testing priority sampling on Internet traffic analysis, we found it to perform an order of magnitude better than previous schemes. Priority sampling is simple to define and implement: we consider a steam of items i = 0,…,n − 1 with weights wi. For each item i, we generate a random number αi ∈ (0,1] and create a priority qi = wi/αi. The sample S consists of the k highest priority items. Let τ be the (k + 1)th highest priority. Each sampled item i in S gets a weight estimate ŵi = max{wi, τ}, while nonsampled items get weight estimate ŵi = 0. Magically, it turns out that the weight estimates are unbiased, that is, E[ŵi] = wi, and by linearity of expectation, we get unbiased estimators over any subset sum simply by adding the sampled weight estimates from the subset. Also, we can estimate the variance of the estimates, and find, surprisingly, that the covariance between estimates ŵi and ŵj of different weights is zero. Finally, we conjecture an extremely strong near-optimality; namely that for any weight sequence, there exists no specialized scheme for sampling k items with unbiased weight estimators that gets smaller variance sum than priority sampling with k + 1 items. Szegedy settled this conjecture at STOC'06.