Algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
The art of computer programming, volume 2 (3rd ed.): seminumerical algorithms
Generating Sorted Lists of Random Numbers
ACM Transactions on Mathematical Software (TOMS)
A note on sampling a tape-file
Communications of the ACM
Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
VLDB '89 Proceedings of the 15th international conference on Very large data bases
Optimal sample cost residues for differential database batch query problems
Journal of the ACM (JACM)
Las Vegas algorithms for linear and integer programming when the dimension is small
Journal of the ACM (JACM)
ACM Transactions on Mathematical Software (TOMS)
A model for the prediction of R-tree performance
PODS '96 Proceedings of the fifteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Output-sensitive generation of random events
Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
An Improved Algorithm for Ordered Sequential Random Sampling
ACM Transactions on Mathematical Software (TOMS)
Efficient Cost Models for Spatial Queries Using R-Trees
IEEE Transactions on Knowledge and Data Engineering
Sampling Strategies for Targeting Rare Groups from a Bank Customer Database
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Simple Random Sampling from Relational Databases
VLDB '86 Proceedings of the 12th International Conference on Very Large Data Bases
Dynamic maintenance of web indexes using landmarks
WWW '03 Proceedings of the 12th international conference on World Wide Web
Range counting over multidimensional data streams
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Random sampling from database files: a survey
SSDBM'1990 Proceedings of the 5th international conference on Statistical and Scientific Database Management
Precision-time tradeoffs: a paradigm for processing statistical queries on databases
SSDBM'1988 Proceedings of the 4th international conference on Statistical and Scientific Database Management
Weighted random sampling with a reservoir
Information Processing Letters
A dip in the reservoir: maintaining sample synopses of evolving datasets
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Mind the gaps: weighting the unknown in large-scale one-class collaborative filtering
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
DOULION: counting triangles in massive graphs with a coin
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Weighted random sampling with a reservoir
Information Processing Letters
Deferred maintenance of disk-based random samples
EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Information Sciences: an International Journal
Hi-index | 48.22 |
Several new methods are presented for selecting n records at random without replacement from a file containing N records. Each algorithm selects the records for the sample in a sequential manner—in the same order the records appear in the file. The algorithms are online in that the records for the sample are selected iteratively with no preprocessing. The algorithms require a constant amount of space and are short and easy to implement. The main result of this paper is the design and analysis of Algorithm D, which does the sampling in O(n) time, on the average; roughly n uniform random variates are generated, and approximately n exponentiation operations (of the form ab, for real numbers a and b) are performed during the sampling. This solves an open problem in the literature. CPU timings on a large mainframe computer indicate that Algorithm D is significantly faster than the sampling algorithms in use today.