Accurately Selecting Block Size at Runtime in Pipelined Parallel Programs

  • Authors:
  • David K. Lowenthal

  • Affiliations:
  • Department of Computer Science, The University of Georgia, Athens, Georgia 30602-7404. dkl@cs.uga.edu

  • Venue:
  • International Journal of Parallel Programming
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Loops that contain cross-processor data dependencies, known as {\tt DOACROSS} loops, are often found in scientific programs. Efficiently parallelizing such loops is important yet nontrivial. One useful parallelization technique for {\tt DOACROSS} loops is pipelining, where each processor (node) performs its computation in blocks; after each, it sends data to the next node in the pipeline. The amount of computation before sending a message is called the block size; its choice, although difficult to make statically, is important for efficient execution. This paper describes a flexible runtime approach to choosing the block size. Rather than rely on static estimation of workload, our system takes measurements during the first two iterations of a program and then uses the results to build an execution model and choose an appropriate block size which, unlike a static choice, may be nonuniform. To increase accuracy of the chosen block size, our execution model takes intra- and inter-node performance into account. It is important to note that our system finds an effective block size automatically, without experimentation that is necessary when using a statically chosen block size. Performance on a network of workstations shows that programs that use our runtime analysis outperform those that use static block sizes by as much as 18% when the workload is unbalanced. When the workload is balanced, competitive performance is achieved as long as the initial overhead is sufficiently amortized.