LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
LogGP: incorporating long messages into the LogP model for parallel computation
Journal of Parallel and Distributed Computing
LoGPC: Modeling Network Contention in Message-Passing Programs
IEEE Transactions on Parallel and Distributed Systems
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Fast Measurement of LogP Parameters for Message Passing Platforms
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
An Evaluation of Current High-Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
GASNet Specification, v1.1
Journal of Parallel and Distributed Computing
Scaling All-to-All Multicast on Fat-tree Networks
ICPADS '04 Proceedings of the Parallel and Distributed Systems, Tenth International Conference
The Impact of MPI Queue Usage on Message Latency
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
International Journal of High Performance Computing Applications
Multiple Page Size Modeling and Optimization
Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques
Performance Modeling and Tuning Strategies of Mixed Mode Collective Communications
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Transformations to Parallel Codes for Communication-Computation Overlap
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A comparison of 4X InfiniBand and Quadrics Elan-4 technologies
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Optimizing bandwidth limited problems using one-sided communication and overlap
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Performance portable optimizations for loops containing communication operations
Proceedings of the 22nd annual international conference on Supercomputing
Runtime optimization of vector operations on large scale SMP clusters
Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Hiding Communication Latency with Non-SPMD, Graph-Based Execution
ICCS '09 Proceedings of the 9th International Conference on Computational Science: Part I
Topology synthesis for low power cascaded crossbar switches
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Hi-index | 0.00 |
Modern networking hardware supports true non-blocking communicationand effective exploitation of this feature can lead to significantapplication performance improvements. We believe that algorithm design and optimization techniques that hide latency by taking advantage of communication overlap will facilitate obtaining good parallel efficiency and performance on the highly concurrent contemporary systems. Finding an optimal, performance portable implementation when using non-blocking communication primitives is non-trivial and intimidating to many application developers. In this paper we present a methodology for discovering optimal message sizes and schedules for a variety of application scenarios. This is achieved by combining an analytic model that takes into account the variability of performance parameters with system scale and load with heuristics designed to avoid network congestion. We perform experiments to understand network behavior in the presence of overlap and purge the optimization space for any system based on either resource or implementation constraints. Our approach isable to choose optimal or nearly optimal implementation parameters fora variety of highly non-trivial scenarios and networks with different performance characteristics. Implementations based on parameters chosen by the models are able to hide over 90% of communicationoverhead in all cases.