Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
The implementation of the Cilk-5 multithreaded language
PLDI '98 Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation
Probability and statistics with reliability, queuing and computer science applications
Probability and statistics with reliability, queuing and computer science applications
Asynchrony in parallel computing: from dataflow to multithreading
Progress in computer research
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
A Communication-Induced Checkpointing Protocol that Ensures Rollback-Dependency Trackability
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Athapascan-1: On-Line Building Data Flow Graph in a Parallel Language
PACT '98 Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques
Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Using Data-Flow Analysis for Resilience and Result Checking in Peer-To-Peer Computations
DEXA '04 Proceedings of the Database and Expert Systems Applications, 15th International Workshop
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Proceedings of the 2007 international workshop on Parallel symbolic computation
Fine Grain Distributed Implementation of a Dataflow Language with Provable Performances
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part II
Performance Evaluation of Work Stealing for Streaming Applications
OPODIS '09 Proceedings of the 13th International Conference on Principles of Distributed Systems
On-line adaptive parallel prefix computation
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Hi-index | 0.00 |
This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.