McrEngine: A scalable checkpointing system using data-aware aggregation and compression

  • Authors:
  • Tanzima Zerin Islam;Kathryn Mohror;Saurabh Bagchi;Adam Moody;Bronis R. de Supinski;Rudolf Eigenmann

  • Affiliations:
  • School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA. E-mails: {tislam, sbagchi, eigenman}@purdue.edu;Lawrence Livermore National Laboratory, Livermore, CA, USA. E-mails: {kathryn, moody20, bronis}@llnl.gov;School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA. E-mails: {tislam, sbagchi, eigenman}@purdue.edu;Lawrence Livermore National Laboratory, Livermore, CA, USA. E-mails: {kathryn, moody20, bronis}@llnl.gov;Lawrence Livermore National Laboratory, Livermore, CA, USA. E-mails: {kathryn, moody20, bronis}@llnl.gov;School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA. E-mails: {tislam, sbagchi, eigenman}@purdue.edu

  • Venue:
  • Scientific Programming - Selected Papers from Super Computing 2012
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

High performance computing HPC systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system PFS. As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure.We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.