Collective operations in NEC's high-performance MPI libraries

Authors:
Hubert Ritzdorf;Jesper Larsson Träff
Affiliations:
C&C Research Laboratories, NEC Europe Ltd., Sankt Augustin, Germany;C&C Research Laboratories, NEC Europe Ltd., Sankt Augustin, Germany
Venue:
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Year:
2006

Citing 17
Cited 6

Algorithms for scalable synchronization on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
On global combine operations

Journal of Parallel and Distributed Computing
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems

IEEE Transactions on Parallel and Distributed Systems
The implementation of MPI-2 one-sided communication for the NEC SX-5

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core

MPI-The Complete Reference, Volume 1: The MPI Core
The Hierarchical Factor Algorithm for All-to-All Communication (Research Note)

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
SKaMPI: A Detailed, Accurate MPI Benchmark

Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Reproducible Measurements of MPI Performance Characteristics

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Improved MPI All-to-all Communication on a Giganet SMP Cluster

Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance Analysis of MPI Collective Operations

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Optimization of MPI collective communication on BlueGene/L systems

Proceedings of the 19th annual international conference on Supercomputing
On optimizing collective communication

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
An optimal broadcast algorithm adapted to SMP clusters

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An improved algorithm for (non-commutative) reduce-scatter with an application

PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimal broadcast for fully connected networks

HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications

Optimal broadcast for fully connected processor-node networks

Journal of Parallel and Distributed Computing
Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters

Cluster Computing
Two-tree algorithms for full bandwidth broadcast, reduction and scan

Parallel Computing
Efficient allgather for regular SMP-Clusters

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Open issues in MPI implementation

ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Full bandwidth broadcast, reduction and scan with only two trees

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/SX and MPI/ES of some of the most important collective operations of MPI (the Message Passing Interface). The infrastructure of MPI/SX makes it easy to incorporate new algorithms and algorithms for common special cases (e.g. a single SX node, or a single MPI process per SX node). Algorithms that are among the best known are employed, and special hardware features of the SX architecture and Internode Crossbar Switch (IXS) are exploited wherever possible. We discuss in more detail the implementation of MPI_Barrier, MPI_Bcast, the MPI reduction collectives, MPI_Alltoall, and the gather/scatter collectives. Performance figures and comparisons to straightforward algorithms are given for a large SX-8 system, and for the Earth Simulator. The measurements show excellent absolute performance, and demonstrate the scalability of MPI/SX and MPI/ES to systems with large numbers of nodes.