Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Authors:
Ying Qian;Ahmad Afsahi
Affiliations:
Department of Electrical and Computer Engineering, Queen's University, Kingston, Canada;Department of Electrical and Computer Engineering, Queen's University, Kingston, Canada
Venue:
Cluster Computing
Year:
2008

Citing 25
Cited 1

LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The communication challenge for MPP: Intel Paragon and Meiko CS-2

Parallel Computing
LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
Optimization of MPI collectives on clusters of large-scale SMP's

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automatically tuned collective communications

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Multiphase Complete Exchange on Paragon, SP2, and CS-2

IEEE Parallel & Distributed Technology: Systems & Technology
Message passing and shared address space parallelism on an SMP cluster

Parallel Computing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters

ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
LiMIC: Support for High-Performance MPI Intra-node Communication on Linux Cluster

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Optimizing Collective Communications on SMP Clusters

ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
QsNetII: Defining High-Performance Network Design

IEEE Micro
Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Collective communication on architectures that support simultaneous communication over multiple links

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Data Transfers between Processes in an SMP System: Performance Study and Application to MPI

ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
Assessing the Ability of Computation/Communication Overlap and Communication Progress in Modern Interconnects

HOTI '07 Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects
RDMA-based and SMP-aware Multi-port All-Gather on Multi-rail QsNet^II SMP Clusters

ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Collective operations in NEC's high-performance MPI libraries

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient allgather for regular SMP-Clusters

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
High performance RDMA based all-to-all broadcast for infiniband clusters

HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Optimised gather collectives on QsNetII

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Process Arrival Pattern and Shared Memory Aware Alltoall on InfiniBand

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter-node communications, and thereby in boosting the performance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNetII with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages.The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan_gather(), and up to 2.26 for 2 KB messages over elan_alltoall(), respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan_gather() for 512 B to 8 KB messages, with a 1.96 improvement factor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan_gather() by a factor of 1.49 for 32 KB messages. Experimentation with real applications has shown up to 1.47 communication speedup can be achieved using the proposed all-gather algorithms.