MagPIe: MPI's collective communication operations for clustered wide area systems
Proceedings of the seventh ACM SIGPLAN symposium on Principles and practice of parallel programming
A network-failure-tolerant message-passing system for terascale clusters
ICS '02 Proceedings of the 16th international conference on Supercomputing
Reducing the variance of point to point transfers in the IBM 9076 parallel computer
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Efficient Multicast on Myrinet using Link-Level Flow Control
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
Fast NIC-Based Barrier over Myrinet/GM
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
COMB: A Portable Benchmark Suite for Assessing MPI Overlap
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Scalable NIC-based Reduction on Large-scale Clusters
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Proceedings of the 22nd annual international conference on Supercomputing
A framework for adaptive collective communications for heterogeneous hierarchical computing systems
Journal of Computer and System Sciences
Efficient offloading of collective communications in large-scale systems
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Hierarchical Collectives in MPICH2
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
International Journal of High Performance Computing Applications
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
The co-design architecture for exascale systems, a novel approach for scalable designs
Computer Science - Research and Development
Hi-index | 0.00 |
As the scale of High Performance Computing (HPC) systems continues to increase, demanding that we extract even more parallelism from applications, the need to move communication management away from the Central Processing Unit (CPU) becomes even greater. Moving this management to the network, frees up CPU cycles for computation, making it possible to overlap computation and communication. In this paper we continue to investigate how to best use the new CORE-Direct support added in the ConnectX-2 Host Channel Adapter (HCA) for creating high performance, asynchronous collective operations that are managed by the HCA. Specifically we consider the network topology, creating a two-level communication hierarchy, reducing the MPI Barrier completion time by 45%, from 26.59 microseconds, when not considering network topology, to 14.72 microseconds, with the CPU based collective barrier operation completing in 19.04 microseconds. The nonblocking barrier algorithm has similar performance, with about 50% of that time available for computation.