VLDB '89 Proceedings of the 15th international conference on Very large data bases
Randomized algorithms
The log-structured merge-tree (LSM-tree)
Acta Informatica
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Database System Implementation
Database System Implementation
A scalable hash ripple join algorithm
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique
ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Concurrency Control Theory for Deferred Materialized Views
ICDT '97 Proceedings of the 6th International Conference on Database Theory
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Incremental Organization for Data Recording and Warehousing
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Novel Index Supporting High Volume Data Warehouse Insertion
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Random Sampling from Pseudo-Ranked B+ Trees
VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract)
WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Random Sampling from Database Files: A Survey
Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
The learning-curve sampling method applied to model-based clustering
The Journal of Machine Learning Research
Finding the most interesting patterns in a database quickly by using sequential sampling
The Journal of Machine Learning Research
Active Sampling for Class Probability Estimation and Ranking
Machine Learning
A bi-level Bernoulli scheme for database sampling
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
The partitioned exponential file for database storage management
The VLDB Journal — The International Journal on Very Large Data Bases
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation
Hi-index | 0.02 |
Many applications require a randomized ordering of input data. Examples include algorithms for online aggregation, data mining, and various randomized algorithms. Most existing work seems to assume that accessing the records from a large database in a randomized order is not a difficult problem. However, it turns out to be extremely difficult in practice. Using existing methods, randomization is either extremely expensive at the front end (as data are loaded), or at the back end (as data are queried). This paper presents a simple file structure which supports both efficient, online random shuffling of a large database, as well as efficient online sampling or randomization of the database when it is queried. The key innovation of our method is the introduction of a small degree of carefully controlled, rigorously monitored nonrandomness into the file.