A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

Authors:
Samir Jafar;Thierry Gautier;Axel Krings;Jean-Louis Roch
Affiliations:
Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), Montbonnot St. Martin, France;Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), Montbonnot St. Martin, France;Computer Science Dept, University of Idaho, Moscow, ID;Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), Montbonnot St. Martin, France
Venue:
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Year:
2005

Citing 14
Cited 4

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
The implementation of the Cilk-5 multithreaded language

PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Probability and statistics with reliability, queuing and computer science applications

Probability and statistics with reliability, queuing and computer science applications
Asynchrony in parallel computing: from dataflow to multithreading

Progress in computer research
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Athapascan-1: On-Line Building Data Flow Graph in a Parallel Language

PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Using Data-Flow Analysis for Resilience and Result Checking in Peer-To-Peer Computations

DEXA '04 Proceedings of the Database and Expert Systems Applications, 15th International Workshop
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing

Probabilistic certification of divide & conquer algorithms on global computing platforms: application to fault-tolerant exact matrix-vector product

Proceedings of the 2007 international workshop on Parallel symbolic computation
Fine Grain Distributed Implementation of a Dataflow Language with Provable Performances

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Performance Evaluation of Work Stealing for Streaming Applications

OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
On-line adaptive parallel prefix computation

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.