Analyzing fault aware collective performance in a process fault tolerant MPI

  • Authors:
  • Joshua Hursey;Richard L. Graham

  • Affiliations:
  • Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA;Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA

  • Venue:
  • Parallel Computing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what traditional techniques alone can provide. Applications will depend on libraries to sustain failure-free performance across process failure to continue to use High Performance Computing (HPC) systems efficiently even in the presence of process failure. Optimized Message Passing Interface (MPI) collective operations are a critical component of many scalable HPC applications. However, most of the collective algorithms are not able to handle process failure. Next generation MPI implementations must provide fault aware versions of such algorithms that can sustain performance across process failure. This paper discusses the design and implementation of fault aware collective algorithms for tree structured communication patterns. The three design approaches of rerouting, lookup avoiding and rebalancing are described, and analyzed for their performance impact relative to similar fault unaware barrier and broadcast collective algorithms. The analysis shows that the rerouting approach causes a significant performance degradation while the rebalancing approach can bring the performance within 1% of the fault unaware performance. This paper also presents the impact of the run-through stabilization prototype on point-to-point communication, and analyzes the time to rebalance the tree while accounting for process failures.