Online Random Shuffling of Large Database Tables

  • Authors:
  • Christopher Jermaine

  • Affiliations:
  • -

  • Venue:
  • IEEE Transactions on Knowledge and Data Engineering
  • Year:
  • 2007
  • MapReduce online

    NSDI'10 Proceedings of the 7th USENIX conference on Networked systems design and implementation

Quantified Score

Hi-index 0.02

Visualization

Abstract

Many applications require a randomized ordering of input data. Examples include algorithms for online aggregation, data mining, and various randomized algorithms. Most existing work seems to assume that accessing the records from a large database in a randomized order is not a difficult problem. However, it turns out to be extremely difficult in practice. Using existing methods, randomization is either extremely expensive at the front end (as data are loaded), or at the back end (as data are queried). This paper presents a simple file structure which supports both efficient, online random shuffling of a large database, as well as efficient online sampling or randomization of the database when it is queried. The key innovation of our method is the introduction of a small degree of carefully controlled, rigorously monitored nonrandomness into the file.