Cache coherence protocols: evaluation using a multiprocessor simulation model
ACM Transactions on Computer Systems (TOCS)
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Lightweight recoverable virtual memory
SOSP '93 Proceedings of the fourteenth ACM symposium on Operating systems principles
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
Tolerating node failures in cache only memory architectures
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Kernel Support for Recoverable-Persistent Virtual Memory
USENIX MACH III Symposium
Notes on Data Base Operating Systems
Operating Systems, An Advanced Course
Integrating coherency and recoverability in distributed systems
OSDI '94 Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation
A comprehensive bibliography of distributed shared memory
ACM SIGOPS Operating Systems Review
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Checkpointing Distributed Shared Memory
The Journal of Supercomputing - Special issue: high performance distributed computing
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
IEEE Transactions on Computers
A Low Overhead Logging Scheme for Fast Recovery in Distributed Shared Memory Systems
The Journal of Supercomputing
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing relevance of memory hardware errors: a case for recoverable programming models
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Lazy Garbage Collection of Recovery State for Fault-Tolerant Distributed Shared Memory
IEEE Transactions on Parallel and Distributed Systems
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Portable transparent checkpointing for distributed shared memory
HPDC '96 Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing
Modeling and evaluating the time overhead induced by BER in COMA multiprocessors
Journal of Systems Architecture: the EUROMICRO Journal
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Global memory management for a multi computer system
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
JVM susceptibility to memory errors
JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
Spark: cluster computing with working sets
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Hi-index | 0.00 |
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM) in order to tolerate single node failure. Although most recoverable DSM require specific hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach limits the hardware development and takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and preliminary performances evaluation of our recoverable DSM on an Intel Paragon with 56 nodes.