Understanding fault-tolerant distributed systems
Communications of the ACM
Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A dynamic load balancing system for parallel cluster computing
Future Generation Computer Systems - Special issue: resource management in distributed systems
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
ACM Computing Surveys (CSUR)
PM2: a high performance communication middleware for heterogeneous network environments
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism
Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Architecture and Dependability of Large-Scale Internet Services
IEEE Internet Computing
ickp: A Consistent Checkpointer for Multicomputers
IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
ECOOP '02 Proceedings of the Workshops and Posters on Object-Oriented Technology
User-Level Checkpointing for LinuxThreads Programs
Proceedings of the FREENIX Track: 2001 USENIX Annual Technical Conference
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Portable Checkpointing for Heterogeneous Archtitectures
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
The design and implementation of Zap: a system for migrating computing environments
ACM SIGOPS Operating Systems Review - OSDI '02: Proceedings of the 5th symposium on Operating systems design and implementation
Commercial Fault Tolerance: A Tale of Two Systems
IEEE Transactions on Dependable and Secure Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Current Practice and a Direction Forward in Checkpoint/Restart Implementations for Fault Tolerance
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 18 - Volume 19
BlueGene/L Failure Analysis and Prediction Models
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Scalable diskless checkpointing for large parallel systems
Scalable diskless checkpointing for large parallel systems
Fault-Tolerant Systems
Efficient hardware checkpointing: concepts, overhead analysis, and implementation
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Live migration of virtual machines
NSDI'05 Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation - Volume 2
Static analysis of executables to detect malicious patterns
SSYM'03 Proceedings of the 12th conference on USENIX Security Symposium - Volume 12
What Supercomputers Say: A Study of Five System Logs
DSN '07 Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
World-wide web cache consistency
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Failure Semantics in a SOA Environment
MCETECH '08 Proceedings of the 2008 International MCETECH Conference on e-Technologies
International Journal of High Performance Computing Applications
A survey and review of the current state of rollback-recovery for cluster systems
Concurrency and Computation: Practice & Experience
DMTCP: Transparent checkpointing for cluster computations and the desktop
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
International Journal of High Performance Computing Applications
The use of triple-modular redundancy to improve computer reliability
IBM Journal of Research and Development
A Large-Scale Study of Failures in High-Performance Computing Systems
IEEE Transactions on Dependable and Secure Computing
Exascale computing technology challenges
VECPAR'10 Proceedings of the 9th international conference on High performance computing for computational science
Application-Level checkpointing techniques for parallel programs
ICDCIT'06 Proceedings of the Third international conference on Distributed Computing and Internet Technology
System structure for software fault tolerance
IEEE Transactions on Software Engineering
Hi-index | 0.00 |
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.