When is multi-version checkpointing needed?

Authors:
Guoming Lu;Ziming Zheng;Andrew A. Chien
Affiliations:
University of Chicagoy, Chicago, IL, USA;University of Chicagoy, Chicago, IL, USA;Department of Computer Science, University of Chicagoy, IL, USA
Venue:
Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale
Year:
2013

Citing 25
Cited 1

A first order approximation to the optimum checkpoint interval

Communications of the ACM
Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery

IEEE Transactions on Dependable and Secure Computing
Assessing Fault Sensitivity in MPI Applications

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
A large-scale study of failures in high-performance computing systems

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Algorithm-Based Fault Tolerance for Matrix Operations

IEEE Transactions on Computers
Soft error vulnerability of iterative linear algebra methods

Proceedings of the 22nd annual international conference on Supercomputing
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
ParaLog: enabling and accelerating online parallel monitoring of multithreaded applications

Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems
Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
The future of microprocessors

Communications of the ACM
Log-based architectures: using multicore to help software behave correctly

ACM SIGOPS Operating Systems Review
Characterizing the impact of soft errors on iterative methods in scientific computing

Proceedings of the international conference on Supercomputing
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
Co-analysis of RAS Log and Job Log on Blue Gene/P

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design

ASPLOS XVII Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems
Cooperative Application/OS DRAM fault recovery

Euro-Par'11 Proceedings of the 2011 international conference on Parallel Processing - Volume 2
ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
Euripus: a flexible unified hardware memory checkpointing accelerator for bidirectional-debugging and reliability

Proceedings of the 39th Annual International Symposium on Computer Architecture
McrEngine: a scalable checkpointing system using data-aware aggregation and compression

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A study of DRAM failures in the field

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Detection and correction of silent data corruption for large-scale high-performance computing

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Aging-aware hardware-software task partitioning for reliable reconfigurable multiprocessor systems

Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that error detection is nearly immediate and thus preserving a single checkpoint is sufficient for resilience. We define a richer model for future systems that captures the reality of latent errors, i.e. errors that go undetected for some time, and use it to derive optimal checkpoint intervals for systems with latent errors. With that model, we explore the importance of multi-version checkpoint systems. Our results highlight the limits of single checkpoint systems, showing that two to more than a dozen checkpoints may be needed to achieve acceptable error coverage. Further, to achieve reasonable system efficiency, multiple versions (two to seventeen) may be needed. We study several specific exascale machine scenarios, and the results show that two checkpoints are always beneficial, but when checkpoint overheads are reduced, as many as three checkpoints are beneficial.