Overlap of computation and communication on shared-memory networks-of-workstations

  • Authors:
  • Tarek S. Abdelrahman;Gary Liu

  • Affiliations:
  • Department of Electrical and Computer Engineering, The University of Toronto, Toronto, Ontario, Canada;IBM Canada, Ltd, Toronto, Ontario, Canada

  • Venue:
  • Cluster computing
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes and evaluates a compiler transformation that improves the performance of parallel programs on Network-of-Workstation (NOW) shared-memory multiprocessors. The transformation overlaps the communication time resulting form nonlocal memory accesses with the computation time in parallel loops to effectively hide the latency of the remote accesses. The transformation peels from a parallel loop iterations that access remote data and re-schedules them after the execution of iterations that access only local data (local-only iterations). Asynchronous prefetching of remote data is used to overlap nonlocal access latency with the execution of local-only iterations. Experimental evaluation of the transformation on a NOW multiprocessor indicates that it is generally effective in improving parallel execution time (up to 1.9 times). The extent of the benefit is determined by three factors: the size of local-only computations, the significance of remote memory access latency, and the position of the iterations that access remote data in a parallel loop.