Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n)))
ACM Transactions on Mathematical Software (TOMS)
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Sampling streaming data with replacement
Computational Statistics & Data Analysis
Optimal sampling from sliding windows
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Optimal sampling from sliding windows
Journal of Computer and System Sciences
Hi-index | 0.01 |
We present a simple algorithm that allows sampling from a stream of data items without knowing the number of items in advance and without having to store all items in main memory. The sampling distribution may be general, that is, the probability of selecting a data item i may depend on the individual item. The main advantage of the algorithms is that they have to pass through the data items only once to produce a sample of arbitrary size n.We give different variants of the algorithm for sampling with and without replacement and analyze their complexity. We generalize earlier results of Knuth on reservoir sampling with a uniform sampling distribution. The general distribution considered here allows us to sample an item with a probability equal to the relative weight (or fitness) of the data item within the whole set of items. Applications include heuristic optimization procedures such as genetic algorithms where solutions are sampled from a population with probability proportional to their fitness.