Redesigning MPI shared memory communication for large multi-core architecture

  • Authors:
  • Miao Luo;Hao Wang;Jerome Vienne;Dhabaleswar K. Panda

  • Affiliations:
  • Department of Computer Science and Engineering, The Ohio State University, Columbus, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, USA;Department of Computer Science and Engineering, The Ohio State University, Columbus, USA

  • Venue:
  • Computer Science - Research and Development
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Modern multi-core platforms are evolving very rapidly with 32/64 cores for node. Sharing of system resource can increase communication efficiency between processes on the same node. However, it also increases contention for system resource. Currently, most MPI libraries are developed for systems with relatively small number of cores per node. On the emerging multi-core systems with hundreds of cores per node, existing shared memory mechanisms for MPI run-times will suffer from scalability problem, which may limit the benefits gained from multi-core system. In this paper, we first analyze these problems and then propose a set of new schemes for small message and large message transfer over shared memory. "Shared Tail Cyclic Buffer" scheme is proposed to reduce the number of read and write operations over shared control structures. "State-Driven Polling" scheme is proposed to optimize the message polling through dynamically adjusted polling frequency on different communication pairs. Through dynamic distribution of runtime pinned-down memory, "On-Demand Global Shared Memory Pool" is proposed to bring benefits of pair-wise buffer to large message transfer and optimize shared send buffer utilization without increasing the total shared memory usage. With micro-benchmark evaluation, the new schemes can bring up to 26 % and 70 % improvement for point-to-point latency and bandwidth performance, respectively. For applications, the new schemes can achieve 18 % improvement on the 64-core/node Bulldozer system for Graph500 benchmark, and up to 11 % improvement for NAS benchmarks. With 512 processes evaluation on 32-core Trestles system, the new schemes achieves 16 % improvement for NAS CG benchmark.