A survey and review of the current state of rollback-recovery for cluster systems

Authors:
Andrew Maloney;Andrzej Goscinski
Affiliations:
School of Information Technology, Deakin University, Australia;School of Information Technology, Deakin University, Australia
Venue:
Concurrency and Computation: Practice & Experience
Year:
2009

Citing 0
Cited 3

ReServE service: an approach to increase reliability in service oriented systems

PaCT'11 Proceedings of the 11th international conference on Parallel computing technologies
Simulating application resilience at exascale

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long-running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long-running applications. Several fault tolerance approaches were investigated; however, it was found that rollback-recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback-recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.