A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Authors:
Ifeanyi P. Egwutuoha;David Levy;Bran Selic;Shiping Chen
Affiliations:
School of Electrical & Information Engineering, The University of Sydney, Sydney, Australia 2006;School of Electrical & Information Engineering, The University of Sydney, Sydney, Australia 2006;School of Electrical & Information Engineering, The University of Sydney, Sydney, Australia 2006;Information Engineering Laboratory, CSIRO ICT Centre, Sydney, Australia
Venue:
The Journal of Supercomputing
Year:
2013

Citing 42
Cited 0

Definition and Analysis of Hardware- and Software-Fault-Tolerant Architectures

Computer
Understanding fault-tolerant distributed systems

Communications of the ACM
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.

IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A dynamic load balancing system for parallel cluster computing

Future Generation Computer Systems - Special issue: resource management in distributed systems
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Process migration

ACM Computing Surveys (CSUR)
PM2: a high performance communication middleware for heterogeneous network environments

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism

Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Architecture and Dependability of Large-Scale Internet Services

IEEE Internet Computing
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Generative Programming

ECOOP '02 Proceedings of the Workshops and Posters on Object-Oriented Technology
User-Level Checkpointing for LinuxThreads Programs

Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Portable Checkpointing for Heterogeneous Archtitectures

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
The design and implementation of Zap: a system for migrating computing environments

ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Commercial Fault Tolerance: A Tale of Two Systems

IEEE Transactions on Dependable and Secure Computing
Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
BlueGene/L Failure Analysis and Prediction Models

DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems

Scalable diskless checkpointing for large parallel systems
SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback

IEEE Micro
Fault-Tolerant Systems

Fault-Tolerant Systems
Efficient hardware checkpointing: concepts, overhead analysis, and implementation

Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Live migration of virtual machines

NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Static analysis of executables to detect malicious patterns

SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
What Supercomputers Say: A Study of Five System Logs

DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
World-wide web cache consistency

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Failure Semantics in a SOA Environment

MCETECH '08 Proceedings of the 2008 International MCETECH Conference on e-Technologies
Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

International Journal of High Performance Computing Applications
A survey and review of the current state of rollback-recovery for cluster systems

Concurrency and Computation: Practice & Experience
DMTCP: Transparent checkpointing for cluster computations and the desktop

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Toward Exascale Resilience

International Journal of High Performance Computing Applications
The use of triple-modular redundancy to improve computer reliability

IBM Journal of Research and Development
A Large-Scale Study of Failures in High-Performance Computing Systems

IEEE Transactions on Dependable and Secure Computing
Exascale computing technology challenges

VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Application-Level checkpointing techniques for parallel programs

ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
System structure for software fault tolerance

IEEE Transactions on Software Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.