Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
An HPF compiler for the IBM SP2
Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
Data Locality Exploitation in the Decomposition of Regular Domain Problems
IEEE Transactions on Parallel and Distributed Systems
Compile-Time Estimation of Communication Costs on Multicomputers
IPPS '92 Proceedings of the 6th International Parallel Processing Symposium
A Compilation Approach for Fortran 90D/ HPF Compilers
Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing
A New Approach to Array Redistribution: Strip Mining Redistribution
PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
The Design for a High-Performance MPI Implementation on the Myrinet Network
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An Evaluation of Current High-Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
GASNet Specification, v1.1
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Productivity and performance using partitioned global address space languages
Proceedings of the 2007 international workshop on Parallel symbolic computation
Optimizing the use of static buffers for DMA on a CELL chip
LCPC'06 Proceedings of the 19th international conference on Languages and compilers for parallel computing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
In this work we investigate how the compiler technique of message strip-mining performs in practice on contemporary high performance networks. Message strip-mining attempts to reduce the overall cost of communication in parallel programs by breaking up large message transfers into smaller ones that can be overlapped with computation. In practice, however, network resource constraints may negate the expected performance gains. By deriving a performance model and synthetic benchmarks we determine how network and application characteristics in.uence the applicability of this optimization. We use these .ndings to determine heuristics to follow when performing this optimization on parallel programs. We propose strip-mining with variable block size as an alternative strategy that performs almost as well as a highly tuned .xed block strategy and has the advantage of being performance portable across systems and application input sets. We evaluate both techniques using synthetic benchmarks and an application from the NAS Parallel Benchmark suite.