LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
The communication challenge for MPP: Intel Paragon and Meiko CS-2
Parallel Computing
Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
Optimization of MPI collectives on clusters of large-scale SMP's
SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Automatically tuned collective communications
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Multiphase Complete Exchange on Paragon, SP2, and CS-2
IEEE Parallel & Distributed Technology: Systems & Technology
Message passing and shared address space parallelism on an SMP cluster
Parallel Computing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Efficient and Scalable All-to-All Personalized Exchange for InfiniBand-Based Clusters
ICPP '04 Proceedings of the 2004 International Conference on Parallel Processing
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
LiMIC: Support for High-Performance MPI Intra-node Communication on Linux Cluster
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Optimizing Collective Communications on SMP Clusters
ICPP '05 Proceedings of the 2005 International Conference on Parallel Processing
Optimizing All-to-All Collective Communication by Exploiting Concurrency in Modern Networks
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Data Transfers between Processes in an SMP System: Performance Study and Application to MPI
ICPP '06 Proceedings of the 2006 International Conference on Parallel Processing
HOTI '07 Proceedings of the 15th Annual IEEE Symposium on High-Performance Interconnects
RDMA-based and SMP-aware Multi-port All-Gather on Multi-rail QsNet^II SMP Clusters
ICPP '07 Proceedings of the 2007 International Conference on Parallel Processing
Collective operations in NEC's high-performance MPI libraries
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Efficient allgather for regular SMP-Clusters
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Efficient shared memory and RDMA based design for MPI_Allgather over infiniband
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
High performance RDMA based all-to-all broadcast for infiniband clusters
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Optimised gather collectives on QsNetII
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Process Arrival Pattern and Shared Memory Aware Alltoall on InfiniBand
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
Clusters of Symmetric Multiprocessors (SMP) are more commonplace than ever in achieving high-performance. Scientific applications running on clusters employ collective communications extensively. Shared memory communication and Remote Direct Memory Access (RDMA) over multi-rail networks are promising approaches in addressing the increasing demand on intra-node and inter-node communications, and thereby in boosting the performance of collectives in emerging multi-core SMP clusters. In this regard, this paper designs and evaluates two classes of collective communication algorithms directly at the Elan user-level over multi-rail Quadrics QsNetII with message striping: 1) RDMA-based traditional multi-port algorithms for gather, all-gather, and all-to-all collectives for medium to large messages, and 2) RDMA-based and SMP-aware multi-port all-gather algorithms for small to medium size messages.The multi-port RDMA-based Direct algorithm for gather and all-to-all collectives gain an improvement of up to 2.15 for 4 KB messages over elan_gather(), and up to 2.26 for 2 KB messages over elan_alltoall(), respectively. For the all-gather, our SMP-aware Bruck algorithm outperforms all other all-gather algorithms including elan_gather() for 512 B to 8 KB messages, with a 1.96 improvement factor for 4 KB messages. Our multi-port Direct all-gather is the best algorithm for 16 KB to 1 MB, and outperforms elan_gather() by a factor of 1.49 for 32 KB messages. Experimentation with real applications has shown up to 1.47 communication speedup can be achieved using the proposed all-gather algorithms.