Cooperative checkpointing: a robust approach to large-scale systems reliability

  • Authors:
  • Adam J. Oliner;Larry Rudolph;Ramendra K. Sahoo

  • Affiliations:
  • Stanford University, Palo Alto, CA;MIT, CSAIL, Cambridge, MA;IBM, T. J. Watson Research Center, Hawthorne, NY

  • Venue:
  • Proceedings of the 20th annual international conference on Supercomputing
  • Year:
  • 2006

Quantified Score

Hi-index 0.03

Visualization

Abstract

Cooperative checkpointing increases the performance and robustness of a system by allowing checkpoints requested by applications to be dynamically skipped at runtime. A robust system must be more than merely resilient to failures; it must be adaptable and flexible in the face of new and evolving challenges. A simulation-based experimental analysis using both probabilistic and harvested failure distributions reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle. Cooperative checkpointing can be easily implemented on top of existing application-initiated checkpointing mechanisms and may be used to enhance other reliability techniques like QoS guarantees and fault-aware job scheduling. The simulations also support a number of theoretical predictions related to cooperative checkpointing, including the non-competitiveness of periodic checkpointing.