Fast pseudorandom-number generators with modulus 2k or 2k-1 using fused multiply-add

  • Authors:
  • R. C. Agarwal;R. F. Enenkel;F. G. Gustavson;A. Kothari;M. Zubair

  • Affiliations:
  • IBM Research Division, Almaden Research Center, San Jose, California;IBM Toronto Laboratory, Markham, Ontario, Canada;IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, New York;-;Old Dominion University, Computer Science Department, Norfolk, Virginia

  • Venue:
  • IBM Journal of Research and Development
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Many numerically intensive computations done in a scientific computing environment require uniformly distributed pseudorandom numbers in the range (0, 1) and (-1, 1). For multiplicative congruential generators with modulus 2k, k ≤ 52, and period 2k-2, we show that the cost per random number for these two distributions is 3 and 3.125 multiplyadds on RS/6000® processors. Our code, on the IBM POWER2 Model 590, produces more than 40 million uniformly distributed pseudorandom numbers per second for both ranges (0, 1) and (-1, 1). Additionally, our code sustains the 40 million per second rate for data out of cache. The Numerical Aerodynamic Simulation (NAS) parallel benchmarks use a linear congruential generator with modulus 246. Our result is about 50 times faster than the generic implementation given in the benchmarks. The extra-accuracy fused multiply-add instruction of RS/6000 machines combined with a few algorithmic innovations gives rise to the 50-fold increase. If IEEE 64-bit arithmetic is used with our Fortran code on POWER and PowerPC® architectures, the results we obtain are bit-wise identical to the generic algorithms. The paper gives several illustrations of a general technique called the Algorithm and Architecture approach. We demonstrate herein that programmercontrolled unrolling of loops is equivalent to customized vectorization of RISC-type code. Customized vectorization is more powerful than ordinary vectorization, and it is only possible on RISC-type machines. We illustrate its use to show that RS/6000 processors can compute the distribution (-1, 1) at the rate of 3.125 multiply-adds. We also specify a linear congruential generator that is related to the multiplicative congruential generator referred to above. It has a full period of 2k, where 2k is the modulus. The cost per random number [in the range (0, 1)] for this generator is four multiply-adds on RS/6000 processors. Our code, on the IBM POWER2 Model 590, for this generator produces more than 30 million uniformly distributed pseudorandom numbers per second for the range (0, 1). We show that this generator is embarrassingly parallel, or EP. Using the Algorithm and Architecture approach, we describe a new concept called "generalized unrolling." Finally, we present a multiplicative congruential generator for which the modulus is not a power of 2. Such a generator, as well as one with modulus 2k, is selectable as the generator used in the RANDOM_NUMBER intrinsic function of IBM XL Fortran and XL High Performance Fortran. All of the generators reported here are EP. Using an IBM SP2 machine with 250 wide nodes, it is possible to compute more than ten billion uniform random numbers in a second.