An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

Authors:
Christine Morin;Anne-Marie Kermarrec;Michel Banâtre;Alain Gefflaut
Affiliations:
IRISA/INRIA, Rennes, France;Microsoft Corp., Cambridge, UK;IRISA/INRIA, Rennes, France;IBM T.J. Watson Research Center, Hawthorne
Venue:
IEEE Transactions on Computers
Year:
2000

Citing 29
Cited 4

Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Firefly: A Multiprocessor Workstation

IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Simple but effective techniques for NUMA memory management

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with platinum

SOSP '89 Proceedings of the twelfth ACM symposium on Operating systems principles
Memory coherence in shared virtual memory systems

ACM Transactions on Computer Systems (TOCS)
Directory-Based Cache Coherence in Large-Scale Multiprocessors

Computer
Recoverable Distributed Shared Virtual Memory

IEEE Transactions on Computers
Abstract execution: a technique for efficiently tracing programs

Software—Practice & Experience
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
DDM: A Cache-Only Memory Architecture

Computer
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
TreadMarks: Shared Memory Computing on Networks of Workstations

Computer
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Lightweight logging for lazy release consistent distributed shared memory

OSDI '96 Proceedings of the second USENIX symposium on Operating systems design and implementation
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
Design, implementation and evaluation of ICARE: an efficient recoverable DSM

Software—Practice & Experience - Special issue on multiprocessor operating systems
Implementing a cache consistency protocol

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice
Tolerating node failures in cache only memory architectures

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
KOAN: A Shared Virtual Memory for the iPSC/2 Hypercube

CONPAR '92/ VAPP V Proceedings of the Second Joint International Conference on Vector and Parallel Processing: Parallel Processing
Fault Tolerance in Distributed Shared Memory Multiprocessors

Parallel Computer Architectures: Theory, Hardware, Software, Applications
The Berkeley Networks of Workstations (NOW) Project

COMPCON '95 Proceedings of the 40th IEEE Computer Society International Conference
An argument for simple COMA

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Smooth and Efficient Integration of High-Availability in a Parallel Single Level Store System

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Engineering Distributed Shared Memory Middleware for Java

OTM '09 Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part I
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture

Quantified Score

Hi-index	14.98

Visualization

Abstract

Distributed Shared Memory (dsm) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based dsm architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (coma) and Shared Virtual Memory (svm) systems. The implementation of the protocol in a coma architecture has been evaluated by simulation. The protocol has also been implemented in an svm system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.