Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Authors:
Sebastian Gerlach;Basile Schaeli;Roger D. Hersch
Affiliations:
Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Ecublens, Switzerland;Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Ecublens, Switzerland;Ecole Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences, Ecublens, Switzerland
Venue:
Dependable Systems
Year:
2006

Citing 10
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A message passing standard for MPP and workstations

Communications of the ACM
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
CALYPSO: a novel software system for fault-tolerant parallel processing on distributed platforms

HPDC '95 Proceedings of the 4th IEEE International Symposium on High Performance Distributed Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
DPS " Dynamic Parallel Schedules

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Fault-Tolerant Parallel Applications with Dynamic Parallel Schedules

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dynamic Parallel Schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application's source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer's perspective.