Parallel programming in Split-C
Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Modeling communication pipeline latency
SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Communication overlap in multi-tier parallel algorithms
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
One-Sided Communication on Clusters with Myrinet
Cluster Computing
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Hiding Communication Latency in Reconfigurable Message-Passing Environments
IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Exploting communication Latency Hiding for Parallel Network
Proceedings of the 1994 International Conference on Parallel and Distributed Systems
COMB: A Portable Benchmark Suite for Assessing MPI Overlap
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Optimizing Message Aggregation for Parallel Simulation on High Performance Clusters
MASCOTS '99 Proceedings of the 7th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
An Evaluation of Current High-Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A New DMA Registration Strategy for Pinning-Based High Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Hi-index | 0.00 |
This paper describes the design and performance evaluation of the mechanisms for latency tolerance in the remote memory access communication on clusters equipped with high-performance networks such as Myrinet. It discusses strategies that bridge the gap between user-level requirements and network-specific communication interfaces while attempting to increase opportunities for latency hiding. Mechanisms for overlapping communication with computation and coalescing small messages (trading latency for bandwidth) are explored. The effectiveness of these techniques is evaluated using microbenchmarks and application kernels including the NAS parallel benchmark suite. The microbenchmark results showed a much better degree of overlap for non-blocking operations in ARMCI when compared with MPI. Application results showed up to 30 45% improvement over MPI on using non-blocking operations. The aggregation of small messages yielded performance improvement of up to 78% over non-aggregated communication.