A case for redundant arrays of inexpensive disks (RAID)
SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A first order approximation to the optimum checkpoint interval
Communications of the ACM
On Checkpoint Latency
Another Two-Level Failure Recovery Scheme
Another Two-Level Failure Recovery Scheme
Investigation of leading HPC I/O performance using a scientific-application derived benchmark
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO
ACM SIGOPS Operating Systems Review
DataStager: scalable data staging services for petascale applications
Proceedings of the 18th ACM international symposium on High performance distributed computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A 1 PB/s file system to checkpoint three million MPI tasks
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Fault tolerance using lower fidelity data in adaptive mesh applications
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Checkpointing algorithms and fault prediction
Journal of Parallel and Distributed Computing
Hi-index | 0.00 |
As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.