Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

Authors:
Maria Chtepen;Filip H. Claeys;Bart Dhoedt;Filip Turck;Peter A. Vanrolleghem;Piet Demeester
Affiliations:
Department of Information Technology (INTEC), Ghent University, Sint-Pietersnieuwstraat 41, Ghent, Belgium;Department of Applied Mathematics, Biometrics and Process Control, (BIOMATH), Ghent University, Coupure Links 653, Ghent, Belgium;Department of Information Technology (INTEC), Ghent University, Sint-Pietersnieuwstraat 41, Ghent, Belgium;Department of Information Technology (INTEC), Ghent University, Sint-Pietersnieuwstraat 41, Ghent, Belgium;Department of Applied Mathematics, Biometrics and Process Control, (BIOMATH), Ghent University, Coupure Links 653, Ghent, Belgium;Department of Information Technology (INTEC), Ghent University, Sint-Pietersnieuwstraat 41, Ghent, Belgium
Venue:
ICCS '07 Proceedings of the 7th international conference on Computational Science, Part I: ICCS 2007
Year:
2007

Citing 6
Cited 0

Checkpointing in distributed computing systems

Journal of Parallel and Distributed Computing
Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

IEEE Transactions on Computers
Combining periodic and probabilistic checkpointing in optimistic simulation

PADS '99 Proceedings of the thirteenth workshop on Parallel and distributed simulation
Improving Performance via Computational Replication on a Large-Scale Computational Grid

CCGRID '03 Proceedings of the 3st International Symposium on Cluster Computing and the Grid
Performance Implications of Periodic Checkpointing on Large-Scale Cluster Systems

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
Performance implications of failures in large-scale cluster scheduling

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant run-time overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several novel heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulation-based comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.