Firefly: A Multiprocessor Workstation
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Simple but effective techniques for NUMA memory management
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Memory coherence in shared virtual memory systems
ACM Transactions on Computer Systems (TOCS)
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
Abstract execution: a technique for efficiently tracing programs
Software—Practice & Experience
Comparative performance evaluation of cache-coherent NUMA and COMA architectures
ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Lightweight logging for lazy release consistent distributed shared memory
OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
A Survey of Recoverable Distributed Shared Virtual Memory Systems
IEEE Transactions on Parallel and Distributed Systems
Design, implementation and evaluation of ICARE: an efficient recoverable DSM
Software—Practice & Experience - Special issue on multiprocessor operating systems
Implementing a cache consistency protocol
ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Fault Tolerance: Principles and Practice
Fault Tolerance: Principles and Practice
Tolerating node failures in cache only memory architectures
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
KOAN: A Shared Virtual Memory for the iPSC/2 Hypercube
CONPAR '92/ VAPP V Proceedings of the Second Joint International Conference on Vector and Parallel Processing: Parallel Processing
Fault Tolerance in Distributed Shared Memory Multiprocessors
Parallel Computer Architectures: Theory, Hardware, Software, Applications
The Berkeley Networks of Workstations (NOW) Project
COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
SPLASH: Stanford parallel applications for shared-memory
SPLASH: Stanford parallel applications for shared-memory
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Smooth and Efficient Integration of High-Availability in a Parallel Single Level Store System
Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Engineering Distributed Shared Memory Middleware for Java
OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Rebound: scalable checkpointing for coherent shared memory
Proceedings of the 38th annual international symposium on Computer architecture
Hi-index | 14.98 |
Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based dsm architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (coma) and Shared Virtual Memory (svm) systems. The implementation of the protocol in a coma architecture has been evaluated by simulation. The protocol has also been implemented in an svm system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.