Optimisation and performance evaluation of mechanisms for latency tolerance in remote memory access communication on clusters

Authors:
J. Nieplocha;V. Tipparaju;M. Krishnan;G. Santhanaraman;D. K. Panda
Affiliations:
Pacific Northwest National Laboratory, Richland, WA 99352, USA.;Pacific Northwest National Laboratory, Richland, WA 99352, USA .;Pacific Northwest National Laboratory, Richland, WA 99352, USA.;Ohio State University, Columbus, OH 43210, USA.;Ohio State University, Columbus, OH 43210, USA
Venue:
International Journal of High Performance Computing and Networking
Year:
2004

Citing 13
Cited 1

Parallel programming in Split-C

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Modeling communication pipeline latency

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Communication overlap in multi-tier parallel algorithms

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
One-Sided Communication on Clusters with Myrinet

Cluster Computing
Protocols and Strategies for Optimizing Performance of Remote Memory Operations on Clusters

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Hiding Communication Latency in Reconfigurable Message-Passing Environments

IPPS '99/SPDP '99 Proceedings of the 13th International Symposium on Parallel Processing and the 10th Symposium on Parallel and Distributed Processing
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Exploting communication Latency Hiding for Parallel Network

Proceedings of the 1994 International Conference on Parallel and Distributed Systems
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Optimizing Message Aggregation for Parallel Simulation on High Performance Clusters

MASCOTS '99 Proceedings of the 7th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
The Effect of Limited Network Bandwidth and its Utilization by Latency Hiding Techniques in Large-scale Shared Memory Systems

PACT '97 Proceedings of the 1997 International Conference on Parallel Architectures and Compilation Techniques
An Evaluation of Current High-Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A New DMA Registration Strategy for Pinning-Based High Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing

A framework for characterizing overlap of communication and computation in parallel applications

Cluster Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the design and performance evaluation of the mechanisms for latency tolerance in the remote memory access communication on clusters equipped with high-performance networks such as Myrinet. It discusses strategies that bridge the gap between user-level requirements and network-specific communication interfaces while attempting to increase opportunities for latency hiding. Mechanisms for overlapping communication with computation and coalescing small messages (trading latency for bandwidth) are explored. The effectiveness of these techniques is evaluated using microbenchmarks and application kernels including the NAS parallel benchmark suite. The microbenchmark results showed a much better degree of overlap for non-blocking operations in ARMCI when compared with MPI. Application results showed up to 30 45% improvement over MPI on using non-blocking operations. The aggregation of small messages yielded performance improvement of up to 78% over non-aggregated communication.