Random sampling with a reservoir

  • Authors:
  • Jeffrey S. Vitter

  • Affiliations:
  • Brown Univ., Providence, RI

  • Venue:
  • ACM Transactions on Mathematical Software (TOMS)
  • Year:
  • 1985

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce fast algorithms for selecting a random sample of n records without replacement from a pool of N records, where the value of N is unknown beforehand. The main result of the paper is the design and analysis of Algorithm Z; it does the sampling in one pass using constant space and in O(n(1 + log(N/n))) expected time, which is optimum, up to a constant factor. Several optimizations are studied that collectively improve the speed of the naive version of the algorithm by an order of magnitude. We give an efficient Pascal-like implementation that incorporates these modifications and that is suitable for general use. Theoretical and empirical results indicate that Algorithm Z outperforms current methods by a significant margin.