Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

Authors:
Sebastian Gerlach;Roger D. Hersch
Affiliations:
Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland;Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland
Venue:
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Year:
2005

Citing 15
Cited 4

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A message passing standard for MPP and workstations

Communications of the ACM
CHIME: a versatile distributed parallel processing system

CHIME: a versatile distributed parallel processing system
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
DPS " Dynamic Parallel Schedules

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing

A debugger for flow graph based parallel applications

Proceedings of the 2007 ACM workshop on Parallel and distributed systems: testing and debugging
Dynamic testing of flow graph based parallel applications

PADTAD '08 Proceedings of the 6th workshop on Parallel and distributed systems: testing, analysis, and debugging
A simulator for adaptive parallel applications

Journal of Computer and System Sciences
Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dependable Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software frameworks used for running parallel applications need to be fault-tolerant in order to ensure continued execution despite node failures. We propose an extension to the flow graph based Dynamic Parallel Schedules (DPS) development framework that allows non-trivial parallel applications to pursue their execution despite node failures. The proposed fault-tolerance mechanism relies on a set of backup threads located in the volatile storage of alternate nodes. These backup threads are kept up to date by duplication of the transmitted data objects and periodical checkpointing of thread states. In case of a failure, the current state of the threads that were on the failed node is reconstructed on the backup threads by re-executing operations. The corresponding valid re-execution order is automatically deduced from the data flow graph of the DPS application. Multiple simultaneous failures can be tolerated, provided that for each thread either the active thread or its corresponding backup thread survives. For threads that do not store a local state, an optimized mechanism eliminates the need for duplicate data object transmissions. The overhead induced by the fault tolerance mechanism consists mainly of duplicate data object transmissions that can, for compute bound applications, be carried out in parallel with ongoing computations. The increase in execution time due to fault tolerance therefore remains relatively low. It depends on the communication to computation ratio and on the parallel program's efficiency.