An Experimental Study about Diskless Checkpointing

Authors:
Luís M. Silva;João Gabriel Silva
Affiliations:
-;-
Venue:
EUROMICRO '98 Proceedings of the 24th Conference on EUROMICRO - Volume 1
Year:
1998

Citing 0
Cited 7

Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Design and performance evaluation of enhanced two-level recovery scheme

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating operating system vulnerability to memory errors

Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers
Accelerating incremental checkpointing for extreme-scale computing

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Checkpointing and rollback-recovery is a very effective technique to tolerate the occurrence of failures. Usually, the checkpoint data is saved in some disk files. However, insome situations the disk operation may result in a considerable performance overhead. Alternative solutions would make use of mainmemory to maintain the checkpoint data.This paper presents two main-memory checkpointing schemes that can be used in anyparallel machine without requiring any change to the hardware: one scheme saves thecheckpoints in the memory of other processors, while the other is based on a parity approach.Both techniques have been implemented and evaluated in a commercial parallel machine.Some conclusions have been taken that clearly show the superiority of one of those schemes.