Network offloaded hierarchical collectives using ConnectX-2's CORE-Direct capabilities

Authors:
Ishai Rabinovitz;Pavel Shamis;Richard L. Graham;Noam Bloch;Gilad Shainer
Affiliations:
Mellanox Technologies, Inc.;Mellanox Technologies, Inc.;Oak Ridge National Laboratory;Mellanox Technologies, Inc.;Mellanox Technologies, Inc
Venue:
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Year:
2010

Citing 15
Cited 1

MagPIe: MPI's collective communication operations for clustered wide area systems

Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
A network-failure-tolerant message-passing system for terascale clusters

ICS '02 Proceedings of the 16th international conference on Supercomputing
Reducing the variance of point to point transfers in the IBM 9076 parallel computer

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Efficient Multicast on Myrinet using Link-Level Flow Control

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Fast NIC-Based Barrier over Myrinet/GM

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
COMB: A Portable Benchmark Suite for Assessing MPI Overlap

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Scalable NIC-based Reduction on Large-scale Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer

Proceedings of the 22nd annual international conference on Supercomputing
A framework for adaptive collective communications for heterogeneous hierarchical computing systems

Journal of Computer and System Sciences
Efficient offloading of collective communications in large-scale systems

CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Hierarchical Collectives in MPICH2

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
The International Exascale Software Project: a Call To Cooperative Action By the Global High-Performance Community

International Journal of High Performance Computing Applications
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

The co-design architecture for exascale systems, a novel approach for scalable designs

Computer Science - Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the scale of High Performance Computing (HPC) systems continues to increase, demanding that we extract even more parallelism from applications, the need to move communication management away from the Central Processing Unit (CPU) becomes even greater. Moving this management to the network, frees up CPU cycles for computation, making it possible to overlap computation and communication. In this paper we continue to investigate how to best use the new CORE-Direct support added in the ConnectX-2 Host Channel Adapter (HCA) for creating high performance, asynchronous collective operations that are managed by the HCA. Specifically we consider the network topology, creating a two-level communication hierarchy, reducing the MPI Barrier completion time by 45%, from 26.59 microseconds, when not considering network topology, to 14.72 microseconds, with the CPU based collective barrier operation completing in 19.04 microseconds. The nonblocking barrier algorithm has similar performance, with about 50% of that time available for computation.