Algorithms for scalable synchronization on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Journal of Parallel and Distributed Computing
Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems
IEEE Transactions on Parallel and Distributed Systems
The implementation of MPI-2 one-sided communication for the NEC SX-5
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
The Hierarchical Factor Algorithm for All-to-All Communication (Research Note)
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
SKaMPI: A Detailed, Accurate MPI Benchmark
Proceedings of the 5th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Reproducible Measurements of MPI Performance Characteristics
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Improved MPI All-to-all Communication on a Giganet SMP Cluster
Proceedings of the 9th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Performance Analysis of MPI Collective Operations
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15 - Volume 16
Optimization of MPI collective communication on BlueGene/L systems
Proceedings of the 19th annual international conference on Supercomputing
On optimizing collective communication
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
An optimal broadcast algorithm adapted to SMP clusters
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An improved algorithm for (non-commutative) reduce-scatter with an application
PVM/MPI'05 Proceedings of the 12th European PVM/MPI users' group conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Optimal broadcast for fully connected networks
HPCC'05 Proceedings of the First international conference on High Performance Computing and Communications
Optimal broadcast for fully connected processor-node networks
Journal of Parallel and Distributed Computing
Two-tree algorithms for full bandwidth broadcast, reduction and scan
Parallel Computing
Efficient allgather for regular SMP-Clusters
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Open issues in MPI implementation
ACSAC'07 Proceedings of the 12th Asia-Pacific conference on Advances in Computer Systems Architecture
Full bandwidth broadcast, reduction and scan with only two trees
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
We give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/SX and MPI/ES of some of the most important collective operations of MPI (the Message Passing Interface). The infrastructure of MPI/SX makes it easy to incorporate new algorithms and algorithms for common special cases (e.g. a single SX node, or a single MPI process per SX node). Algorithms that are among the best known are employed, and special hardware features of the SX architecture and Internode Crossbar Switch (IXS) are exploited wherever possible. We discuss in more detail the implementation of MPI_Barrier, MPI_Bcast, the MPI reduction collectives, MPI_Alltoall, and the gather/scatter collectives. Performance figures and comparisons to straightforward algorithms are given for a large SX-8 system, and for the Earth Simulator. The measurements show excellent absolute performance, and demonstrate the scalability of MPI/SX and MPI/ES to systems with large numbers of nodes.