Online Random Shuffling of Large Database Tables

Authors:
Christopher Jermaine
Affiliations:
-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 24
Cited 1

Random sampling from B+ trees

VLDB '89 Proceedings of the 15th international conference on Very large data bases
Randomized algorithms

Randomized algorithms
The log-structured merge-tree (LSM-tree)

Acta Informatica
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Ripple joins for online aggregation

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Database System Implementation

Database System Implementation
A scalable hash ripple join algorithm

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Concurrency Control Theory for Deferred Materialized Views

ICDT '97 Proceedings of the 6th International Conference on Database Theory
A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Incremental Organization for Data Recording and Warehousing

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Novel Index Supporting High Volume Data Warehouse Insertion

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Random Sampling from Pseudo-Ranked B+ Trees

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
The Buffer Tree: A New Technique for Optimal I/O-Algorithms (Extended Abstract)

WADS '95 Proceedings of the 4th International Workshop on Algorithms and Data Structures
Random Sampling from Database Files: A Survey

Proceedings of the 5th International Conference SSDBM on Statistical and Scientific Database Management
The learning-curve sampling method applied to model-based clustering

The Journal of Machine Learning Research
Finding the most interesting patterns in a database quickly by using sequential sampling

The Journal of Machine Learning Research
Active Sampling for Class Probability Estimation and Ranking

Machine Learning
A bi-level Bernoulli scheme for database sampling

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Effective use of block-level sampling in statistics estimation

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
The partitioned exponential file for database storage management

The VLDB Journal — The International Journal on Very Large Data Bases
False positive or false negative: mining frequent itemsets from high speed transactional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30

MapReduce online

NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation

Quantified Score

Hi-index	0.02

Visualization

Abstract

Many applications require a randomized ordering of input data. Examples include algorithms for online aggregation, data mining, and various randomized algorithms. Most existing work seems to assume that accessing the records from a large database in a randomized order is not a difficult problem. However, it turns out to be extremely difficult in practice. Using existing methods, randomization is either extremely expensive at the front end (as data are loaded), or at the back end (as data are queried). This paper presents a simple file structure which supports both efficient, online random shuffling of a large database, as well as efficient online sampling or randomization of the database when it is queried. The key innovation of our method is the introduction of a small degree of carefully controlled, rigorously monitored nonrandomness into the file.