MPI/FT: A Model-Based Approach to Low-Overhead Fault Tolerant Message-Passing Middleware

Authors:
Rajanikanth Batchu;Yoginder S. Dandass;Anthony Skjellum;Murali Beddhu
Affiliations:
Mississippi State University, Department of Computer Science, Box 9637, Mississippi State, MS 39762, USA;Mississippi State University, Department of Computer Science, Box 9637, Mississippi State, MS 39762, USA;Verari Systems Software, Inc., 110 12th Street North, Suite D103, Birmingham, AL 35203, USA;The University of Southern Mississippi, Department of Computer Science and Statistics, 118 College Drive, Box 5106, Hattiesburg, MS 39406, USA
Venue:
Cluster Computing
Year:
2004

Citing 15
Cited 6

Fault-Tolerant Computing: Fundamental Concepts

Computer
Checkpointing in distributed computing systems

Journal of Parallel and Distributed Computing
A Variational Calculus Approach to Optimal Checkpoint Placement

IEEE Transactions on Computers
High Performance Cluster Computing: Architectures and Systems

High Performance Cluster Computing: Architectures and Systems
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Algorithm-Based Fault Tolerance for FFT Networks

IEEE Transactions on Computers
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
The Ensemble System

The Ensemble System
Achieving Scalable Cluster System Analysis and Management with a Gossip-Based Network Service

LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
Overlapping of communication and computation and early binding: fundamental mechanisms for improving parallel performance on clusters of workstations

Overlapping of communication and computation and early binding: fundamental mechanisms for improving parallel performance on clusters of workstations
A gossip-style failure detection service

Middleware '98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

A Scalable and Efficient Self-Organizing Failure Detector for Grid Applications

GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Algorithm 897: VTDIRECT95: Serial and parallel codes for the global optimization algorithm direct

ACM Transactions on Mathematical Software (TOMS)
Fault-management in P2P-MPI

International Journal of Parallel Programming
Fault tolerance in an industrial seismic processing application for multicore clusters

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault tolerance in parallel systems has traditionally been achieved through a combination of redundancy and checkpointing methods. This notion has also been extended to message-passing systems with user-transparent process checkpointing and message logging. Furthermore, studies of multiple types of rollback and recovery have been reported in literature, ranging from communication-induced checkpointing to pessimistic and synchronous solutions. However, many of these solutions incorporate high overhead because of their inability to utilize application level information.This paper describes the design and implementation of MPI/FT, a high-performance MPI-1.2 implementation enhanced with low-overhead functionality to detect and recover from process failures. The strategy behind MPI/FT is that fault tolerance in message-passing middleware can be optimized based on an application's execution model derived from its communication topology and parallel programming semantics. MPI/FT exploits the specific characteristics of two parallel application execution models in order to optimize performance. MPI/FT also introduces the self-checking thread that monitors the functioning of the middleware itself. User aware checkpointing and user-assisted recovery are compatible with MPI/FT and complement the techniques used here.This paper offers a classification of MPI applications for fault tolerant MPI purposes and MPI/FT implementation discussed here provides different middleware versions specifically tailored to each of the two models studied in detail. The interplay of various parameters affecting the cost of fault tolerance is investigated. Experimental results demonstrate that the approach used to design and implement MPI/FT results in a low-overhead MPI-based fault tolerant communication middleware implementation.