Overlap of computation and communication on shared-memory networks-of-workstations

Authors:
Tarek S. Abdelrahman;Gary Liu
Affiliations:
Department of Electrical and Computer Engineering, The University of Toronto, Toronto, Ontario, Canada;IBM Canada, Ltd, Toronto, Ontario, Canada
Venue:
Cluster computing
Year:
2001

Citing 14
Cited 5

A mechanism for keeping useful internal information in parallel programming tools: the data access descriptor

Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines

ICS '92 Proceedings of the 6th international conference on Supercomputing
The high performance Fortran handbook

The high performance Fortran handbook
A performance study of software and hardware data prefetching schemes

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
Operating system support for improving data locality on CC-NUMA compute servers

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Iteration space slicing and its application to communication optimization

ICS '97 Proceedings of the 11th international conference on Supercomputing
Tolerating latency in software distributed shared memory systems through non-binding prefetching

Tolerating latency in software distributed shared memory systems through non-binding prefetching
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
Compiler Support for Array Distribution onNUMA Shared Memory Multiprocessors

The Journal of Supercomputing
Parallel Programming with Polaris

Computer
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing

Optimizing irregular shared-memory applications for distributed-memory systems

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications

Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Transforming the adaptive irregular out-of-core applications for hiding communication and disk I/O

OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
A case for non-blocking collective operations

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes and evaluates a compiler transformation that improves the performance of parallel programs on Network-of-Workstation (NOW) shared-memory multiprocessors. The transformation overlaps the communication time resulting form nonlocal memory accesses with the computation time in parallel loops to effectively hide the latency of the remote accesses. The transformation peels from a parallel loop iterations that access remote data and re-schedules them after the execution of iterations that access only local data (local-only iterations). Asynchronous prefetching of remote data is used to overlap nonlocal access latency with the execution of local-only iterations. Experimental evaluation of the transformation on a NOW multiprocessor indicates that it is generally effective in improving parallel execution time (up to 1.9 times). The extent of the benefit is determined by three factors: the size of local-only computations, the significance of remote memory access latency, and the position of the iterations that access remote data in a parallel loop.