Optimizing MPI one sided communication on multi-core infiniband clusters using shared memory backed windows

  • Authors:
  • Sreeram Potluri;Hao Wang;Vijay Dhanraj;Sayantan Sur;Dhabaleswar K. Panda

  • Affiliations:
  • Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University;Department of Computer Science and Engineering, The Ohio State University

  • Venue:
  • EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Message Passing Interface (MPI) has been very popular for programming parallel scientific applications. As the multi-core architectures have become prevalent, a major question that has emerged is about the use of MPI within a compute node and its impact on communication costs. The one-sided communication interface in MPI provides a mechanism to reduce communication costs by removing matching requirements of the send/receive model. The MPI standard provides the flexibility to allocate memory windows backed by shared memory. However, state-of-the-art open-source MPI libraries do not leverage this optimization opportunity for commodity clusters. In this paper, we present a design and implementation of intra-node MPI one-sided interface using shared memory backed windows on multi-core clusters. We use MVAPICH2 MPI library for design, implementation and evaluation. Micro-benchmark evaluation shows that the new design can bring up to 85% improvement in Put, Get and Accumulate latencies, with passive synchronization mode. The band width performance of Put and Get improves by 64% and 42%, respectively. Splash LU benchmark shows an improvement of up to 55% with the new design on 32 core Magny-cours node. It shows similar improvement on a 12 core Westmere node. The mean BFS time in Graph500 reduces by 39% and 77% on Magny-cours and Westmere nodes, respectively.