libhashckpt: hash-based incremental checkpointing using GPU's

  • Authors:
  • Kurt B. Ferreira;Rolf Riesen;Ron Brighwell;Patrick Bridges;Dorian Arnold

  • Affiliations:
  • Scalable System Software, Sandia National Laboratories;IBM Research, Ireland;Scalable System Software, Sandia National Laboratories;Department of Computer Science, University of New Mexico;Department of Computer Science, University of New Mexico

  • Venue:
  • EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
  • Year:
  • 2011

Quantified Score

Hi-index 0.01

Visualization

Abstract

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications.