Adaptive Checkpoint Replication for Supporting the Fault Tolerance of Applications in the Grid

  • Authors:
  • Andre Luckow;Bettina Schnor

  • Affiliations:
  • -;-

  • Venue:
  • NCA '08 Proceedings of the 2008 Seventh IEEE International Symposium on Network Computing and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A major challenge in a dynamic Grid with thousands of machines connected toeach other is fault tolerance. The more resources and components involved, themore complicated and error-prone becomes the system. Migol is an adaptive Grid middleware,which addresses the fault tolerance of Grid applications and services by providing the capability to recover applications from checkpoint files automatically. A critical aspect for an automatic recovery is the availability of checkpoint files: If a resource becomes unavailable, it is very likely that the associated storage is also unreachable, e. g. due to a network partition. A strategy to increase the availability of checkpoints isreplication.In this paper, we present the Checkpoint Replication Service. A key feature of this service is the ability to automatically replicate and monitor checkpoints in the Grid.