Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Authors:
Adam Moody;Greg Bronevetsky;Kathryn Mohror;Bronis R. de Supinski
Affiliations:
-;-;-;-
Venue:
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2010

Citing 20
Cited 44

A case for redundant arrays of inexpensive disks (RAID)

SIGMOD '88 Proceedings of the 1988 ACM SIGMOD international conference on Management of data
IGOR: a system for program debugging via reversible execution

PADD '88 Proceedings of the 1988 ACM SIGPLAN and SIGOPS workshop on Parallel and distributed debugging
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
A first order approximation to the optimum checkpoint interval

Communications of the ACM
Processor allocation and checkpoint interval selection in cluster computing systems

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Performance Evaluation of a Two Level Error Recovery Scheme for Distributed Systems

IWDC '02 Proceedings of the 4th International Workshop on Distributed Computing, Mobile and Wireless Computing
A model of roll-back recovery with multiple checkpoints

ICSE '76 Proceedings of the 2nd international conference on Software engineering
A Case of Multi-Level Distributed Recovery Schemes

A Case of Multi-Level Distributed Recovery Schemes
Adaptive incremental checkpointing for massively parallel systems

Proceedings of the 18th annual international conference on Supercomputing
Parallel File System Testing for the Lunatic Fringe: The Care and Feeding of Restless I/O Power Users

MSST '05 Proceedings of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Cooperative checkpointing: a robust approach to large-scale systems reliability

Proceedings of the 20th annual international conference on Supercomputing
ZOID: I/O-forwarding infrastructure for petascale architectures

Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
The use of triple-modular redundancy to improve computer reliability

IBM Journal of Research and Development
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems

Exascale computing technology challenges

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Exascale algorithms for generalized MPI_comm_split

EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Reversible Parallel Discrete Event Formulation of a TLM-Based Radio Signal Propagation Model

ACM Transactions on Modeling and Computer Simulation (TOMACS)
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SpotMPI: a framework for auction-based HPC computing using amazon spot instances

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Using active NVRAM for I/O staging

Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities
Application monitoring and checkpointing in HPC: looking towards exascale systems

Proceedings of the 50th Annual Southeast Regional Conference
Simulating application resilience at exascale

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
On the viability of checkpoint compression for extreme scale fault tolerance

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
Can checkpoint/restart mechanisms benefit from hierarchical data staging?

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
A case for tracking and exploiting inter-node and intra-node memory content sharing in virtualized large-scale parallel systems

Proceedings of the 6th international workshop on Virtualization Technologies in Distributed Computing Date
Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Integrated in-system storage architecture for high performance computing

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
The RAMDISK storage accelerator: a method of accelerating I/O performance on HPC systems using RAMDISKs

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Resilience for collaborative applications on clouds: fault-tolerance for distributed HPC applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Design and modeling of non-blocking checkpoint system

Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way?
NV-process: a fault-tolerance process model based on non-volatile memory

Proceedings of the Asia-Pacific Workshop on Systems
NV-process: a fault-tolerance process model based on non-volatile memory

APSys'12 Proceedings of the Third ACM SIGOPS Asia-Pacific conference on Systems
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Design and modeling of a non-blocking checkpointing system

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
A 1 PB/s file system to checkpoint three million MPI tasks

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Replication for send-deterministic MPI HPC applications

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
When is multi-version checkpointing needed?

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system

Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Science at LLNL with IBM Blue Gene/Q

IBM Journal of Research and Development
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
A 'cool' way of improving the reliability of HPC machines

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
ACR: automatic checkpoint/restart for soft and hard error protection

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SPBC: leveraging the characteristics of MPI HPC applications for scalable checkpointing

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

ACM SIGOPS Operating Systems Review
Evaluating energy savings for checkpoint/restart

E2SC '13 Proceedings of the 1st International Workshop on Energy Efficient Supercomputing
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems
McrEngine: A scalable checkpointing system using data-aware aggregation and compression

Scientific Programming - Selected Papers from Super Computing 2012
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with different costs and different levels of resiliency in a single run. This solution employs lightweight checkpoints to handle the most common failure modes and relies on more expensive checkpoints for less common, but more severe failures. This theoretically promising approach has not been fully evaluated in a large- scale, production system context. We have designed the Scalable Checkpoint/Restart (SCR) library, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system. We present the performance and reliability properties of SCR as well as a probabilistic Markov model that predicts its performance on current and future systems. We show that multi-level checkpointing improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. This leads to a gain in machine efficiency of up to 35%, and it reduces the the load on the parallel file system by a factor of two on current and future systems.