Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Distributed Shared Memory: A Survey of Issues and Algorithms
Computer - Distributed computing systems: separate resources acting as one
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
Transparent fault-tolerance in parallel Orca programs
SEDMS III Papers from the symposium on Experiences with distributed and multiprocessor systems
Integrating message-passing and shared-memory: early experience
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
CRL: high-performance all-software distributed shared memory
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
On distributed object checkpointing and recovery
Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
The directory-based cache coherence protocol for the DASH multiprocessor
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
The performance of consistent checkpointing in distributed shared memory systems
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A longitudinal survey of Internet host reliability
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Virtual Shared Memory: A Survey of Techniques and Systems
Virtual Shared Memory: A Survey of Techniques and Systems
TreadMarks: distributed shared memory on standard workstations and operating systems
WTEC'94 Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference
Protocols for Fault-Tolerant Distributed-Shared-Memory on the SOME-Bus Multiprocessor Architecture
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Hi-index | 0.00 |
Distributed shared memory (DSM) is a very promising programming modelfor exploiting the parallelism of distributed memory systems, becauseit provides a higher level of abstraction than simple message passing.Although the nodes of standard distributed systems exhibit high crashrates only very few DSM environments have some kind of support forfault-tolerance.In this article, we present a checkpointing mechanism for a DSM systemthat is efficient and portable. It offers some portability because itis built on top of MPI and uses only the services offered by MPI and aPOSIX compliant local file system.As far as we know, this is the first real implementation of such ascheme for DSM. Along with the description of the algorithm we presentexperimental results obtained in a cluster of workstations. We hopethat our research shows that efficient, transparent and portablecheckpointing is viable for DSM systems.