MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives

Authors:
Richard L. Graham;Galen Shipman
Affiliations:
Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Year:
2008

Citing 5
Cited 6

Optimization of MPI collectives on clusters of large-scale SMP's

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Fast Collective Operations Using Shared and Remote Memory Access Protocols on Clusters

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
Exploring Collective Communications on A Cluster of SMPs

HPCASIA '04 Proceedings of the High Performance Computing and Grid in Asia Pacific Region, Seventh International Conference
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid

Exploiting Direct Access Shared Memory for MPI On Multi-Core Processors

International Journal of High Performance Computing Applications
ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Efficient and reliable network tomography in heterogeneous networks using BitTorrent broadcasts and clustering algorithms

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
NUMA-aware shared-memory collective communication for MPI

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Using in-flight chains to build a scalable cache coherence protocol

ACM Transactions on Architecture and Code Optimization (TACO)
Efficient and reliable network tomography in heterogeneous networks using BitTorrent broadcasts and clustering algorithms

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

With local core counts on the rise, taking advantage of shared-memory to optimize collective operations can improve performance. We study several on-host shared memory optimized algorithms for MPI_Bcast, MPI_Reduce, and MPI_Allreduce, using tree-based, and reduce-scatter algorithms. For small data operations with relatively large synchronization costs fan-in/fan-out algorithms generally perform best. For large messages data manipulation constitute the largest cost and reduce-scatter algorithms are best for reductions. These optimization improve performance by up to a factor of three. Memory and cache sharing effect require deliberate process layout and careful radix selection for tree-based methods.