Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Transparent fault-tolerance in parallel Orca programs
SEDMS III Papers from the symposium on Experiences with distributed and multiprocessor systems
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On distributed object checkpointing and recovery
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The performance of consistent checkpointing in distributed shared memory systems
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Implementing Distributed Shared Memory on Top of MPI: The DSMPI Library
PDP '96 Proceedings of the 4th Euromicro Workshop on Parallel and Distributed Processing (PDP '96)
Virtual Shared Memory: A Survey of Techniques and Systems
Virtual Shared Memory: A Survey of Techniques and Systems
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Hi-index | 0.00 |
We present a checkpointing mechanism for a DSM system that, in spite of being invisible to the programmer, is quite efficient and portable. It is efficient because it is nonblocking, coordinated and thus domino-effect free. It offers some portability because it is built on top of MPI and uses only the services offered by MPI and a POSIX compliant local file system. As far as we know, this is the first real implementation of such a scheme for DSM. Along with the description of the algorithms used, we present experimental results obtained in a cluster of workstations, and discuss many insights that came out of the implementation effort. We hope that our research shows that efficient, transparent and portable checkpointing is viable for DSM systems.