When is multi-version checkpointing needed?

  • Authors:
  • Guoming Lu;Ziming Zheng;Andrew A. Chien

  • Affiliations:
  • University of Chicagoy, Chicago, IL, USA;University of Chicagoy, Chicago, IL, USA;Department of Computer Science, University of Chicagoy, IL, USA

  • Venue:
  • Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that error detection is nearly immediate and thus preserving a single checkpoint is sufficient for resilience. We define a richer model for future systems that captures the reality of latent errors, i.e. errors that go undetected for some time, and use it to derive optimal checkpoint intervals for systems with latent errors. With that model, we explore the importance of multi-version checkpoint systems. Our results highlight the limits of single checkpoint systems, showing that two to more than a dozen checkpoints may be needed to achieve acceptable error coverage. Further, to achieve reasonable system efficiency, multiple versions (two to seventeen) may be needed. We study several specific exascale machine scenarios, and the results show that two checkpoints are always beneficial, but when checkpoint overheads are reduced, as many as three checkpoints are beneficial.