Journal of Parallel and Distributed Computing - Special issue: software tools for parallel programming and visualization
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines
ICS '92 Proceedings of the 6th international conference on Supercomputing
The high performance Fortran handbook
The high performance Fortran handbook
A performance study of software and hardware data prefetching schemes
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Operating system support for improving data locality on CC-NUMA compute servers
Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Iteration space slicing and its application to communication optimization
ICS '97 Proceedings of the 11th international conference on Supercomputing
Tolerating latency in software distributed shared memory systems through non-binding prefetching
Tolerating latency in software distributed shared memory systems through non-binding prefetching
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
Compiler Support for Array Distribution onNUMA Shared Memory Multiprocessors
The Journal of Supercomputing
Parallel Programming with Polaris
Computer
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors
ICPP '97 Proceedings of the international Conference on Parallel Processing
Optimizing irregular shared-memory applications for distributed-memory systems
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Leveraging non-blocking collective communication in high-performance applications
Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures
Transforming the adaptive irregular out-of-core applications for hiding communication and disk I/O
OTM'07 Proceedings of the 2007 OTM confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part II
A case for non-blocking collective operations
ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Hi-index | 0.00 |
This paper describes and evaluates a compiler transformation that improves the performance of parallel programs on Network-of-Workstation (NOW) shared-memory multiprocessors. The transformation overlaps the communication time resulting form nonlocal memory accesses with the computation time in parallel loops to effectively hide the latency of the remote accesses. The transformation peels from a parallel loop iterations that access remote data and re-schedules them after the execution of iterations that access only local data (local-only iterations). Asynchronous prefetching of remote data is used to overlap nonlocal access latency with the execution of local-only iterations. Experimental evaluation of the transformation on a NOW multiprocessor indicates that it is generally effective in improving parallel execution time (up to 1.9 times). The extent of the benefit is determined by three factors: the size of local-only computations, the significance of remote memory access latency, and the position of the iterations that access remote data in a parallel loop.