IGOR: a system for program debugging via reversible execution
PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
On-line data compression in a log-structured file system
ASPLOS V Proceedings of the fifth international conference on Architectural support for programming languages and operating systems
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Cactus Tools for Grid Applications
Cluster Computing
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
I/O Analysis and Optimization for an AMR Cosmology Application
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Comparison of Several Difference Schemes on 1D and 2D Test Problems for the Euler Equations
SIAM Journal on Scientific Computing
Adaptive incremental checkpointing for massively parallel systems
Proceedings of the 18th annual international conference on Supercomputing
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery
IEEE Transactions on Dependable and Secure Computing
MSST '05 Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies
Optimizing bitmap indices with efficient compression
ACM Transactions on Database Systems (TODS)
Experimental evaluation of application-level checkpointing for OpenMP programs
Proceedings of the 20th annual international conference on Supercomputing
Fast and Efficient Compression of Floating-Point Data
IEEE Transactions on Visualization and Computer Graphics
ZOID: I/O-forwarding infrastructure for petascale architectures
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Coordinated checkpoint versus message log for fault tolerant MPI
International Journal of High Performance Computing and Networking
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
FPC: A High-Speed Compressor for Double-Precision Floating-Point Data
IEEE Transactions on Computers
DMTCP: Transparent checkpointing for cluster computations and the desktop
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
FALCON: a system for reliable checkpoint recovery in shared grid environments
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
The cactus framework and toolkit: design and applications
VECPAR'02 Proceedings of the 5th international conference on High performance computing for computational science
Distributed Diskless Checkpoint for Large Scale Systems
CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
libhashckpt: hash-based incremental checkpointing using GPU's
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
When is multi-version checkpointing needed?
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Hi-index | 0.00 |
High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. mcrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.