Fault tolerant high performance computing by a coding approach
Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Design and performance evaluation of enhanced two-level recovery scheme
PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing
Proceedings of the international conference on Supercomputing
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating operating system vulnerability to memory errors
Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Accelerating incremental checkpointing for extreme-scale computing
Future Generation Computer Systems
Hi-index | 0.00 |
Checkpointing and rollback-recovery is a very effective technique to tolerate the occurrence of failures. Usually, the checkpoint data is saved in some disk files. However, insome situations the disk operation may result in a considerable performance overhead. Alternative solutions would make use of mainmemory to maintain the checkpoint data.This paper presents two main-memory checkpointing schemes that can be used in anyparallel machine without requiring any change to the hardware: one scheme saves thecheckpoints in the memory of other processors, while the other is based on a parity approach.Both techniques have been implemented and evaluated in a commercial parallel machine.Some conclusions have been taken that clearly show the superiority of one of those schemes.