A survey and review of the current state of rollback-recovery for cluster systems

  • Authors:
  • Andrew Maloney;Andrzej Goscinski

  • Affiliations:
  • School of Information Technology, Deakin University, Australia;School of Information Technology, Deakin University, Australia

  • Venue:
  • Concurrency and Computation: Practice & Experience
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long-running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long-running applications. Several fault tolerance approaches were investigated; however, it was found that rollback-recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback-recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.