Faster methods for random sampling

Authors:
Jeffrey Scott Vitter
Affiliations:
Brown Univ., Providence, RI
Venue:
Communications of the ACM
Year:
1984

Citing 4
Cited 25

Algorithms

Algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms

The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Generating Sorted Lists of Random Numbers

ACM Transactions on Mathematical Software (TOMS)
A note on sampling a tape-file

Communications of the ACM

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Optimal sample cost residues for differential database batch query problems

Journal of the ACM (JACM)
Las Vegas algorithms for linear and integer programming when the dimension is small

Journal of the ACM (JACM)
Sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
A model for the prediction of R-tree performance

PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Output-sensitive generation of random events

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
An Improved Algorithm for Ordered Sequential Random Sampling

ACM Transactions on Mathematical Software (TOMS)
Efficient Cost Models for Spatial Queries Using R-Trees

IEEE Transactions on Knowledge and Data Engineering
Sampling Strategies for Targeting Rare Groups from a Bank Customer Database

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Simple Random Sampling from Relational Databases

VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Dynamic maintenance of web indexes using landmarks

WWW '03 Proceedings of the 12th international conference on World Wide Web
Range counting over multidimensional data streams

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Random sampling from database files: a survey

SSDBM'1990 Proceedings of the 5th international conference on Statistical and Scientific Database Management
Precision-time tradeoffs: a paradigm for processing statistical queries on databases

SSDBM'1988 Proceedings of the 4th international conference on Statistical and Scientific Database Management
Weighted random sampling with a reservoir

Information Processing Letters
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient Update of Indexes for Dynamically Changing Web Documents

World Wide Web
Sampling time-based sliding windows in bounded space

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Mind the gaps: weighting the unknown in large-scale one-class collaborative filtering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
DOULION: counting triangles in massive graphs with a coin

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Weighted random sampling with a reservoir

Information Processing Letters
Deferred maintenance of disk-based random samples

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
FastSIR algorithm: A fast algorithm for the simulation of the epidemic spread in large networks by using the susceptible-infected-recovered compartment model

Information Sciences: an International Journal

Quantified Score

Hi-index	48.22

Visualization

Abstract

Several new methods are presented for selecting n records at random without replacement from a file containing N records. Each algorithm selects the records for the sample in a sequential manner—in the same order the records appear in the file. The algorithms are online in that the records for the sample are selected iteratively with no preprocessing. The algorithms require a constant amount of space and are short and easy to implement. The main result of this paper is the design and analysis of Algorithm D, which does the sampling in O(n) time, on the average; roughly n uniform random variates are generated, and approximately n exponentiation operations (of the form ab, for real numbers a and b) are performed during the sampling. This solves an open problem in the literature. CPU timings on a large mainframe computer indicate that Algorithm D is significantly faster than the sampling algorithms in use today.