Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
VLDB '89 Proceedings of the 15th international conference on Very large data bases
Random sampling from hash files
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
The log-structured merge-tree (LSM-tree)
Acta Informatica
Bifocal sampling for skew-resistant join size estimation
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Approximate computation of multidimensional aggregates of sparse data using wavelets
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Join synopses for approximate query answering
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
The Aqua approximate query answering system
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Approximating multi-dimensional aggregate range queries over real attributes
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
On computing correlated aggregates over continual data streams
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A robust, optimization-based approach for approximate answering of aggregate queries
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Processing complex aggregate queries over data streams
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Gigascope: high performance network monitoring with an SQL interface
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Fast incremental maintenance of approximate histograms
ACM Transactions on Database Systems (TODS)
Informix under CONTROL: Online Query Processing
Data Mining and Knowledge Discovery
Overcoming Limitations of Sampling for Aggregation Queries
Proceedings of the 17th International Conference on Data Engineering
A Novel Index Supporting High Volume Data Warehouse Insertion
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract)
WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Sampling Large Databases for Association Rules
VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Random Sampling from Database Files: A Survey
Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
Approximate join processing over data streams
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Dynamic sample selection for approximate query processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust estimation with sampling and approximate pre-aggregation
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Hi-index | 0.00 |
Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. We also present algorithms to retrieve small size random sample from large disk-based sample which may be used for various purposes including statistical analyses by a DBMS.