A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

  • Authors:
  • Joshua Hursey;Thomas Naughton;Geoffroy Vallee;Richard L. Graham

  • Affiliations:
  • Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN;Oak Ridge National Laboratory, Oak Ridge, TN

  • Venue:
  • EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.