Can software reliability outperform hardware reliability on high performance interconnects?: a case study with MPI over infiniband

  • Authors:
  • Matthew J. Koop;Rahul Kumar;Dhabaleswar K. Panda

  • Affiliations:
  • The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA;The Ohio State University, Columbus, OH, USA

  • Venue:
  • Proceedings of the 22nd annual international conference on Supercomputing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

An important part of modern supercomputing platforms is the network interconnect. As the number of computing nodes in clusters have increased, the role of the interconnect has become more important. Modern interconnects, such as InfiniBand, Quadrics, and Myrinet have become popular due to their low latency and increased performance over traditional Ethernet. As these interconnects become more widely used and clusters continue to scale, design choices such as where data reliability should be provided are an important issue. In this work we address the issue of network reliability design using InfiniBand as a case study. Unlike other high-performance interconnects, InfiniBand exposes both reliable and unreliable APIs. As part of our study we implement the Message Passing Interface (MPI) over the Unreliable Connection (UC) transport and compare with the Reliable Connection (RC) and Unreliable Datagram (UD) transports for MPI. We detail the costs of reliability for different message patterns and show that providing reliability in software instead of hardware can increase performance up to 25% in a molecular dynamics application (NAMD) on a 512-core InfiniBand cluster.