Optimizing Strided Remote Memory Access Operations on the Quadrics QsNetII Network Interconnect

Authors:
Jarek Nieplocha;Vinod Tipparaju;Manoj Krishnan
Affiliations:
Pacific Northwest National Laboratory;Pacific Northwest National Laboratory;Pacific Northwest National Laboratory
Venue:
HPCASIA '05 Proceedings of the Eighth International Conference on High-Performance Computing in Asia-Pacific Region
Year:
2005

Citing 4
Cited 3

One-Sided Communication on Clusters with Myrinet

Cluster Computing
ARMCI: A Portable Remote Memory Copy Libray for Ditributed Array Libraries and Compiler Run-Time Systems

Proceedings of the 11 IPPS/SPDP'99 Workshops Held in Conjunction with the 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing
Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Hardware- and Software-Based Collective Communication on the Quadrics Network

NCA '01 Proceedings of the IEEE International Symposium on Network Computing and Applications (NCA'01)

Performance portable optimizations for loops containing communication operations

Proceedings of the 22nd annual international conference on Supercomputing
Runtime optimization of vector operations on large scale SMP clusters

Proceedings of the 17th international conference on Parallel architectures and compilation techniques
Enabling a highly-scalable global address space model for petascale computing

Proceedings of the 7th ACM international conference on Computing frontiers

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes and evaluates protocols for optimizing strided non-contiguous communication on the Quadrics QsNetII high-performance network interconnect. Most of previous related studies focused primarily on NIC-based or host-based protocols. This paper discusses merits for using both approaches and tries to determine types and data sizes in the communication operations for which these protocols should be used. We focus on the Quadrics QsNetII network, which offers powerful communication processors on the network interface card (NIC) and practical and flexible opportunities for exploiting them in context of the user. Furthermore, the paper focuses on non-contiguous data remote memory access (RMA) transfers and performs the evaluation in context of standalone communication and application microbenchmarks. In comparison to the vendor provided noncontiguous interfaces, proposed approach achieved significant performance improvement in context of microbenchmarks as well as application kernels; dense matrix multiplication, and the Co-Array Fortran version of the NAS BT parallel benchmark. For example, for NAS BT Class B, 54% improvement in overall communication time and a 42% improvement in matrix multiplication was achieved for 64 processes.