Quickly generating billion-record synthetic databases

  • Authors:
  • Jim Gray;Prakash Sundaresan;Susanne Englert;Ken Baclawski;Peter J. Weinberger

  • Affiliations:
  • Digital Equipment Corporation, 455 Market, San Francisco, CA;Digital Equipment Corporation, 455 Market, San Francisco, CA;Tandem Computers Inc., 19333 Vallco Parkway, Cupertino, CA;Computer Science, Northeastern University, 360 Huntington Av. Boston, MA;Bell Laboratories, 600 Mountain Ave, Murry Hill, NJ

  • Venue:
  • SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
  • Year:
  • 1994

Quantified Score

Hi-index 0.00

Visualization

Abstract

Evaluating database system performance often requires generating synthetic databases—ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses: (1) Parallelism to get generation speedup and scaleup. (2) Congruential generators to get dense unique uniform distributions. (3) Special-case discrete logarithms to generate indices concurrent to the base table generation. (4) Modification of (2) to get exponential, normal, and self-similar distributions.The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.