User level failure mitigation in MPI

  • Authors:
  • Wesley Bland

  • Affiliations:
  • Innovative Computing Laboratory, University of Tennessee

  • Venue:
  • Euro-Par'12 Proceedings of the 18th international conference on Parallel processing workshops
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

In a constant effort to deliver steady performance improvements, the size of High Performance Computing (HPC) systems, as observed by the Top 500 ranking1, has grown tremendously over the last decade. This trend, along with the resultant decrease of the Mean Time Between Failure (MTBF), is unlikely to stop; thereby many computing nodes will inevitably fail during application execution [5]. It is alarming that most popular fault tolerant approaches see their efficiency plummet at Exascale [3,4], calling for more efficient approaches evolving around application centric failure mitigation strategies [7].