Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

Authors:
Maria Chtepen;Filip H. A. Claeys;Bart Dhoedt;Filip De Turck;Piet Demeester;Peter A. Vanrolleghem
Affiliations:
Ghent University - IBBT, Gent;CTO MOSTforWATER N.V., Belgium;Ghent University - IBBT, Gent;Ghent University - IBBT, Gent;Ghent University - IBBT, Gent;Université Laval, Québec
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
2009

Citing 0
Cited 11

An integrated security-aware job scheduling strategy for large-scale computational grids

Future Generation Computer Systems
An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Journal of Computer and System Sciences
Robust task scheduling for volunteer computing systems

The Journal of Supercomputing
Rescheduling for reliable job completion with the support of clouds

Future Generation Computer Systems
Providing resiliency for optical grids by exploiting relocation: A dimensioning study based on ILP

Computer Communications
Architecture-based fault tolerance support for grid applications

Proceedings of the joint ACM SIGSOFT conference -- QoSA and ACM SIGSOFT symposium -- ISARCS on Quality of software architectures -- QoSA and architecting critical systems -- ISARCS
Online execution time prediction for computationally intensive applications with periodic progress updates

The Journal of Supercomputing
Parameterised architectural patterns for providing cloud service fault tolerance with accurate costings

Proceedings of the 16th International ACM Sigsoft symposium on Component-based software engineering
Autonomous, failure-resilient orchestration of distributed discrete event simulations

Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Fault detection and recovery efficiency co-optimization through compile-time analysis and runtime adaptation

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems
Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the abovementioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment Dynamic Scheduling in Distributed Environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency.