Design and modeling of a non-blocking checkpointing system

Authors:
Kento Sato;Naoya Maruyama;Kathryn Mohror;Adam Moody;Todd Gamblin;Bronis R. de Supinski;Satoshi Matsuoka
Affiliations:
Tokyo Institute of Technology, Ohokayama, Meguro-ku, Tokyo Japan;Advanced Institute for Computational Science RIKEN, Minatojima-minami-machi, Chuo-ku, Kobe, Hyogo, Japan;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;Lawrence Livermore National Laboratory, Livermore, CA;Global Scientific Information and Computing Center, Tokyo Institute of Technology, Ohokayama, Meguro-ku, Tokyo Japan
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 10
Cited 4

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
A first order approximation to the optimum checkpoint interval

Communications of the ACM
On Checkpoint Latency

On Checkpoint Latency
Another Two-Level Failure Recovery Scheme

Another Two-Level Failure Recovery Scheme
Investigation of leading HPC I/O performance using a scientific-application derived benchmark

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Comparative evaluation of overlap strategies with study of I/O overlap in MPI-IO

ACM SIGOPS Operating Systems Review
DataStager: scalable data staging services for petascale applications

Proceedings of the 18th ACM international symposium on High performance distributed computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Fault tolerance using lower fidelity data in adaptive mesh applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Checkpointing algorithms and fault prediction

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.