A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

Authors:
Anne-Marie Kermarrec;Gilbert Cabillic;Alain Gefflaut;Christine Morin;Isabelle Puaut
Affiliations:
-;-;-;-;-
Venue:
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Year:
1995

Citing 10
Cited 18

Cache coherence protocols: evaluation using a multiprocessor simulation model

ACM Transactions on Computer Systems (TOCS)
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Recoverable Distributed Shared Virtual Memory

IEEE Transactions on Computers
Lightweight recoverable virtual memory

SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice
Tolerating node failures in cache only memory architectures

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Kernel Support for Recoverable-Persistent Virtual Memory

USENIX MACH III Symposium
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
Integrating coherency and recoverability in distributed systems

OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation

A comprehensive bibliography of distributed shared memory

ACM SIGOPS Operating Systems Review
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Checkpointing Distributed Shared Memory

The Journal of Supercomputing - Special issue: high performance distributed computing
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

IEEE Transactions on Computers
A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems

The Journal of Supercomputing
Scalable fault-tolerant distributed shared memory

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing relevance of memory hardware errors: a case for recoverable programming models

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory

IEEE Transactions on Parallel and Distributed Systems
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters

HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Portable transparent checkpointing for distributed shared memory

HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Modeling and evaluating the time overhead induced by BER in COMA multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
Fast and transparent recovery for continuous availability of cluster-based servers

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Global memory management for a multi computer system

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
JVM susceptibility to memory errors

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
Spark: cluster computing with working sets

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failure. Although most recoverable DSM require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach limits the hardware development and takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and preliminary performances evaluation of our recoverable DSM on an Intel Paragon with 56 nodes.