Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing

Authors:
Samir Jafar;Axel Krings;Thierry Gautier
Affiliations:
University of Damascus, Damascus;University of Idaho, Moscow;INRIA, (Projet MOASIS), LIG, St. Martin
Venue:
IEEE Transactions on Dependable and Secure Computing
Year:
2009

Citing 0
Cited 7

Design for survivability: a tradeoff space

Proceedings of the 4th annual workshop on Cyber security and information intelligence research: developing strategies to meet the cyber security and information intelligence challenges ahead
Insect sensory systems inspired computing and communications

Ad Hoc Networks
A signature scheme for distributed executions based on control flow analysis

SIIS'11 Proceedings of the 2011 international conference on Security and Intelligent Information Systems
Eventual strong consensus with fault detection in the presence of dual failure mode on processors under dynamic networks

Journal of Network and Computer Applications
Impact of over-decomposition on coordinated checkpoint/rollback protocol

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
From immediate agreement to eventual agreement: early stopping agreement protocol for dynamic networks with malicious faulty processors

The Journal of Supercomputing
OSIRIS-SR: a scalable yet reliable distributed workflow execution engine

Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large applications executing on Grid or cluster architectures consisting of hundreds or thousands of computational nodes create problems with respect to reliability. The source of the problems are node failures and the need for dynamic configuration over extensive run-time. This paper presents two fault-tolerance mechanisms called Theft Induced Checkpointing and Systematic Event Logging. These are transparent protocols capable of overcoming problems associated with both, benign faults, i.e., crash faults, and node or subnet volatility. Specifically, the protocols base the state of the execution on a dataflow graph, allowing for efficient recovery in dynamic heterogeneous systems as well as multi-threaded applications. By allowing recovery even under different numbers of processors, the approaches are especially suitable for applications with need for adaptive or reactionary configuration control. The low-cost protocols offer the capability of controlling or bounding the overhead. A formal cost model is presented, followed by an experimental evaluation. It is shown that the overhead of the protocol is very small and the maximum work lost by a crashed process is small and bounded.