Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Memory access buffering in multiprocessors
ISCA '86 Proceedings of the 13th annual international symposium on Computer architecture
The fail-stop processor approach
Concurrency control and reliability in distributed systems
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Real-time, concurrent checkpoint for parallel programs
PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Virtual memory primitives for user programs
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Implementation and performance of Munin
SOSP '91 Proceedings of the thirteenth ACM symposium on Operating systems principles
The Stanford Dash Multiprocessor
Computer
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Snooping fault-tolerant distributed shared memories
Journal of Systems and Software
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Ensuring correct rollback recovery in distributed shared memory systems
Journal of Parallel and Distributed Computing - Special issue on distributed shared memory systems
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Stardust: an environment for parallel programming on networks of heterogeneous workstations
Journal of Parallel and Distributed Computing - Special issue on workstation clusters and network-based computing
Replication and fault-tolerance in the ISIS system
Proceedings of the tenth ACM symposium on Operating systems principles
Memory consistency and event ordering in scalable shared-memory multiprocessors
ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
Notes on Data Base Operating Systems
Operating Systems, An Advanced Course
Distributed Systems - Architecture and Implementation, An Advanced Course
The performance of consistent checkpointing in distributed shared memory systems
SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Shared virtual memory on loosely coupled multiprocessors
Shared virtual memory on loosely coupled multiprocessors
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
IEEE Transactions on Computers
Scalable fault-tolerant distributed shared memory
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Fault Tolerance and Configurability in DSM Coherence Protocols
IEEE Concurrency
Evaluating the DSMIO Cache-Coherence Algorithm in Cluster-Based Parallel ODBMS
OOIS '02 Proceedings of the 8th International Conference on Object-Oriented. Information Systems
Smooth and Efficient Integration of High-Availability in a Parallel Single Level Store System
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Dynamic Data Replication: An Approach to Providing Fault-Tolerant Shared Memory Clusters
HPCA '03 Proceedings of the 9th International Symposium on High-Performance Computer Architecture
Towards an efficient single system image cluster operating system
Future Generation Computer Systems - Special issue: Advanced services for clusters and internet computing
Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Architecture
IEEE Transactions on Parallel and Distributed Systems
Performance of Fault-Tolerant Distributed Shared Memory on Broadcast- and Switch-Based Architectures
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 14 - Volume 15
Fast and transparent recovery for continuous availability of cluster-based servers
Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
Log-based rollback recovery without checkpoints of shared memory in software DSM
The Journal of Supercomputing
Analysis of a Redundant Architecture for Critical Infrastructure Protection
Architecting Dependable Systems V
Data access in distributed simulations of multi-agent systems
Journal of Systems and Software
Efficient hybrid parallelisation of tiled algorithms on SMP clusters
International Journal of Computational Science and Engineering
A novel approach to enhance distributed virtual memory
Computers and Electrical Engineering
Hi-index | 0.01 |
Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverableDSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure.