MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

  • Authors:
  • Rajanikanth Batchu;Yoginder S. Dandass;Anthony Skjellum;Murali Beddhu

  • Affiliations:
  • Mississippi State University, Department of Computer Science, Box 9637, Mississippi State, MS 39762, USA;Mississippi State University, Department of Computer Science, Box 9637, Mississippi State, MS 39762, USA;Verari Systems Software, Inc., 110 12th Street North, Suite D103, Birmingham, AL 35203, USA;The University of Southern Mississippi, Department of Computer Science and Statistics, 118 College Drive, Box 5106, Hattiesburg, MS 39406, USA

  • Venue:
  • Cluster Computing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.This paper describes the design and implementation of MPI/FT, a high-performance MPI-1.2 implementation enhanced with low-overhead functionality to detect and recover from process failures. The strategy behind MPI/FT is that fault tolerance in message-passing middleware can be optimized based on an application's execution model derived from its communication topology and parallel programming semantics. MPI/FT exploits the specific characteristics of two parallel application execution models in order to optimize performance. MPI/FT also introduces the self-checking thread that monitors the functioning of the middleware itself. User aware checkpointing and user-assisted recovery are compatible with MPI/FT and complement the techniques used here.This paper offers a classification of MPI applications for fault tolerant MPI purposes and MPI/FT implementation discussed here provides different middleware versions specifically tailored to each of the two models studied in detail. The interplay of various parameters affecting the cost of fault tolerance is investigated. Experimental results demonstrate that the approach used to design and implement MPI/FT results in a low-overhead MPI-based fault tolerant communication middleware implementation.