Evaluation of checkpoint mechanisms for massively parallel machines

Authors:
Tzi-Cker Chiueh;Peitao Deng
Affiliations:
-;-
Venue:
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Year:
1996

Citing 2
Cited 13

On the Reconfigurable Operation of Arrays with Defects for Image Processing

Proceedings of the IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques

SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Cyclic Storage for Fault-Tolerant Distributed Executions

IEEE Transactions on Parallel and Distributed Systems
Algorithm-based fault tolerance applied to high performance computing

Journal of Parallel and Distributed Computing
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Design and performance evaluation of enhanced two-level recovery scheme

PDCN '08 Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks
A novel fault-tolerant parallel algorithm

APPT'07 Proceedings of the 7th international conference on Advanced parallel processing technologies
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
High performance linpack benchmark: a fault tolerant implementation without checkpointing

Proceedings of the international conference on Supercomputing
Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Containment domains: A scalable, efficient and flexible resilience scheme for exascale systems

Scientific Programming - Selected Papers from Super Computing 2012

Quantified Score

Hi-index	0.00

Visualization

Abstract

Massively parallel machines typically contain thousands of processor units and therefore are more likely to suffer system breakdown because of component failures. This paper studies efficient diskless checkpointing mechanisms for SIMD massively parallel machines. Three checkpointing schemes: mirror checkpointing, parity checkpointing, and partial parity checkpointing are compared in terms of their checkpoint performance and storage overheads, based on empirical measurements. Mirror checkpointing and parity checkpointing schemes have been successfully implemented and tested on a DECmpp 12000 machine, without hardware or OS modifications. It has been shown that mirror checkpointing is an order of magnitude faster than parity checkpointing, but takes twice as much storage overhead. Partial parity checkpointing, although significantly reduces the storage overhead, could lead to unpredictable execution performance. This paper also examines the detailed storage/performance tradeoffs for partial parity checkpointing through manual instrumentation, and describes the implementation experience from these experiments.