The performance of consistent checkpointing in distributed shared memory systems

Authors:
G. Cabillic;G. Muller;I. Puaut
Affiliations:
-;-;-
Venue:
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
Year:
1995

Citing 0
Cited 15

A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Checkpointing Distributed Shared Memory

The Journal of Supercomputing - Special issue: high performance distributed computing
Staggered Consistent Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems

The Journal of Supercomputing
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine

EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
An efficient causal logging scheme for recoverable distributed shared memory systems

Parallel Computing
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Portable transparent checkpointing for distributed shared memory

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Global memory management for a multi computer system

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Efficient user-level thread migration and checkpointing on windows NT clusters

WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Engineering Distributed Shared Memory Middleware for Java

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.