MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives

  • Authors:
  • Richard L. Graham;Galen Shipman

  • Affiliations:
  • Oak Ridge National Laboratory, Oak Ridge, TN, USA;Oak Ridge National Laboratory, Oak Ridge, TN, USA

  • Venue:
  • Proceedings of the 15th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

With local core counts on the rise, taking advantage of shared-memory to optimize collective operations can improve performance. We study several on-host shared memory optimized algorithms for MPI_Bcast, MPI_Reduce, and MPI_Allreduce, using tree-based, and reduce-scatter algorithms. For small data operations with relatively large synchronization costs fan-in/fan-out algorithms generally perform best. For large messages data manipulation constitute the largest cost and reduce-scatter algorithms are best for reductions. These optimization improve performance by up to a factor of three. Memory and cache sharing effect require deliberate process layout and careful radix selection for tree-based methods.