McrEngine: a scalable checkpointing system using data-aware aggregation and compression

Authors:
Tanzima Zerin Islam;Kathryn Mohror;Saurabh Bagchi;Adam Moody;Bronis R. de Supinski;Rudolf Eigenmann
Affiliations:
Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory (LLNL), Livermore, CA;Purdue University, West Lafayette, IN;Lawrence Livermore National Laboratory (LLNL), Livermore, CA;Lawrence Livermore National Laboratory (LLNL), Livermore, CA;Purdue University, West Lafayette, IN
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 27
Cited 1

IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
On-line data compression in a log-structured file system

ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Cactus Tools for Grid Applications

Cluster Computing
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
I/O Analysis and Optimization for an AMR Cosmology Application

CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Comparison of Several Difference Schemes on 1D and 2D Test Problems for the Euler Equations

SIAM Journal on Scientific Computing
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Parallel File System Testing for the Lunatic Fringe: The Care and Feeding of Restless I/O Power Users

MSST '05 Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies
Optimizing bitmap indices with efficient compression

ACM Transactions on Database Systems (TODS)
Experimental evaluation of application-level checkpointing for OpenMP programs

Proceedings of the 20th annual international conference on Supercomputing
Fast and Efficient Compression of Floating-Point Data

IEEE Transactions on Visualization and Computer Graphics
ZOID: I/O-forwarding infrastructure for petascale architectures

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Coordinated checkpoint versus message log for fault tolerant MPI

International Journal of High Performance Computing and Networking
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Characterizing and predicting the I/O performance of HPC applications using a parameterized synthetic benchmark

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
FPC: A High-Speed Compressor for Double-Precision Floating-Point Data

IEEE Transactions on Computers
DMTCP: Transparent checkpointing for cluster computations and the desktop

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
FALCON: a system for reliable checkpoint recovery in shared grid environments

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
The cactus framework and toolkit: design and applications

VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
libhashckpt: hash-based incremental checkpointing using GPU's

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale

Quantified Score

Hi-index	0.00

Visualization

Abstract

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. mcrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.