Sequential reservoir sampling with a nonuniform distribution

  • Authors:
  • M. Kolonko;D. Wäsch

  • Affiliations:
  • Technical University of Clausthal, Clausthal-Zellerfeld, Germany;Technical University of Clausthal, Clausthal-Zellerfeld, Germany

  • Venue:
  • ACM Transactions on Mathematical Software (TOMS)
  • Year:
  • 2006

Quantified Score

Hi-index 0.01

Visualization

Abstract

We present a simple algorithm that allows sampling from a stream of data items without knowing the number of items in advance and without having to store all items in main memory. The sampling distribution may be general, that is, the probability of selecting a data item i may depend on the individual item. The main advantage of the algorithms is that they have to pass through the data items only once to produce a sample of arbitrary size n.We give different variants of the algorithm for sampling with and without replacement and analyze their complexity. We generalize earlier results of Knuth on reservoir sampling with a uniform sampling distribution. The general distribution considered here allows us to sample an item with a probability equal to the relative weight (or fitness) of the data item within the whole set of items. Applications include heuristic optimization procedures such as genetic algorithms where solutions are sampled from a population with probability proportional to their fitness.