ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

Authors:
Richard L. Graham;Steve Poole;Pavel Shamis;Gil Bloch;Noam Bloch;Hillel Chapman;Michael Kagan;Ariel Shahar;Ishai Rabinovitz;Gilad Shainer
Affiliations:
-;-;-;-;-;-;-;-;-;-
Venue:
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Year:
2010

Citing 7
Cited 9

Reducing the variance of point to point transfers in the IBM 9076 parallel computer

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Efficient Multicast on Myrinet using Link-Level Flow Control

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
(R) Efficient Reliable Multicast on MYRINET

ICPP '96 Proceedings of the Proceedings of the 1996 International Conference on Parallel Processing - Volume 3
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scalable NIC-based Reduction on Large-scale Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives

Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Network offloaded hierarchical collectives using ConnectX-2's CORE-Direct capabilities

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT

Computer Science - Research and Development
Kernel-based offload of collective operations: implementation, evaluation and lessons learned

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Design and Implementation of Portable and Efficient Non-blocking Collective Communication

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Runtime detection and optimization of collective communication patterns

Proceedings of the 21st international conference on Parallel architectures and compilation techniques
Assessing the performance and scalability of a novel multilevel k-nomial allgather on CORE-Direct systems

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Improving MPI communication overlap with collaborative polling

EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
The co-design architecture for exascale systems, a novel approach for scalable designs

Computer Science - Research and Development
Improving MPI communication overlap with collaborative polling

Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data dependency description provided by the user. This functionality is supported by the ConnectX-2 HCA, and provides the means for delegating collective communication management and progress to the HCA, also known as collective communication offload. This provides a means for overlapping collective communications managed by the HCA and computation on the Central Processing Unit (CPU), thus making it possible to reduce the impact of system noise on parallel applications using collective operations. This paper further describes how this new capability can be used to implement scalable Message Passing Interface (MPI) collective operations, describing the high level details of how this new capability is used to implement the MPI Barrier collective operation, focusing on the latency sensitive performance aspects of this new capability. This paper concludes with small scale bench- mark experiments comparing implementations of the barrier collective operation, using the new network offload capabilities, with established point-to-point based implementations of these same algorithms, which manage the data flow using the central processing unit. These early results demonstrate the promise this new capability provides to improve the scalability of high- performance applications using collective communications. The latency of the HCA based implementation of the barrier is similar to that of the best performing point-to-point based implementation managed by the central processing unit, starting to outperform these as the number of processes involved in the collective operation increases.