A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Checkpointing Distributed Shared Memory
The Journal of Supercomputing - Special issue: high performance distributed computing
Staggered Consistent Checkpointing
IEEE Transactions on Parallel and Distributed Systems
The Journal of Supercomputing
A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems
The Journal of Supercomputing
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
An Experimental Evaluation of Coordinated Checkpointing in a Parallel Machine
EDCC-3 Proceedings of the Third European Dependable Computing Conference on Dependable Computing
Supporting fault-tolerance in heterogeneous distributed applications
HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)
Portable transparent checkpointing for distributed shared memory
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Global memory management for a multi computer system
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
Efficient user-level thread migration and checkpointing on windows NT clusters
WINSYM'99 Proceedings of the 3rd conference on USENIX Windows NT Symposium - Volume 3
Engineering Distributed Shared Memory Middleware for Java
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Hi-index | 0.00 |
This paper presents the design and implementation of a consistent checkpointing scheme for distributed shared memory (DSM) systems. Our approach relies on the integration of checkpoints within synchronization barriers already existing in applications; this avoids the need to introduce an additional synchronization mechanism. The main advantage of our checkpointing mechanism is that performance degradation arises only when a checkpoint is being taken; hence, the programmer can adjust the trade-off between the cost of checkpointing and the cost of longer rollbacks by adjusting the time between two successive checkpoints. The paper compares several implementations of the proposed consistent checkpointing mechanism (incremental, non-blocking, and pre-flushing) on the Intel Paragon multicomputer for several parallel scientific applications. Performance measures show that a careful optimization of the checkpointing protocol can reduce the time overhead of checkpointing from 8% to 0.04% of the application duration for a 6 mn checkpointing interval.