An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

Authors:
Michel Banâtre;Alain Gefflaut;Philippe Joubert;Christine Morin;Peter A. Lee
Affiliations:
-;-;-;-;-
Venue:
IEEE Transactions on Computers
Year:
1996

Citing 22
Cited 7

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
The design and building of Enchère, a distributed electronic marketing system

Communications of the ACM
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
The fail-stop processor approach

Concurrency control and reliability in distributed systems
The STRATUS computer system

Resilient computing systems: vol. 1
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Abstract execution: a technique for efficiently tracing programs

Software—Practice & Experience
Reliable computer systems (2nd ed.): design and evaluation

Reliable computer systems (2nd ed.): design and evaluation
The Stanford Dash Multiprocessor

Computer
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Implementing a cache consistency protocol

ISCA '85 Proceedings of the 12th annual international symposium on Computer architecture
Fault Tolerance: Principles and Practice

Fault Tolerance: Principles and Practice
Error Recovery in Shared Memory Multiprocessors Using Private Caches

IEEE Transactions on Parallel and Distributed Systems
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
The Performance of Cache-Based Error Recovery in Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Notes on Data Base Operating Systems

Operating Systems, An Advanced Course
Dynamic decentralized cache schemes for mimd parallel processors

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
A low-overhead coherence solution for multiprocessors with private cache memories

ISCA '84 Proceedings of the 11th annual international symposium on Computer architecture
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Fault-Tolerant Distributed Shared Memory on a Broadcast-Based Architecture

IEEE Transactions on Parallel and Distributed Systems
Cherry-MP: Correctly Integrating Checkpointed Early Resource Recycling in Chip Multiprocessors

Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture
Alleviating scalability issues of checkpointing protocols

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	14.99

Visualization

Abstract

This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications.