A compilation method for communication—efficient partitioning of DOALL loops

  • Authors:
  • Santosh Pande;Tareq Bali

  • Affiliations:
  • -;-

  • Venue:
  • Compiler optimizations for scalable parallel systems
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation loadba lance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops [6]. But a large category of DOALL loops inevitably result in communication and the tradeoffs between computation and communication must be carefully analyzed for those loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+ communication load balanced partitioning through static data and iteration space distribution. First, code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and re-used eliminating a large communication volume. A new larger partition owns rule is formulated to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller non-compute intensive partition. A Partition Interaction Graph is then constructedt hat is used to merge the partitions to achieve granularity adjustment, computation+communication load balance and mapping on the actual number of available processors. Relevant theory anda lgorithms are developed along with a performance evaluation on Cray T3D.