COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

Authors:
Christine Morin;Alain Gefflaut;Michel Banâtre;Anne-Marie Kermarrec
Affiliations:
-;-;-;-
Venue:
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Year:
1996

Citing 15
Cited 9

The structure of System/88, a fault-tolerant computer

IBM Systems Journal
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Directory-Based Cache Coherence in Large-Scale Multiprocessors

Computer
Abstract execution: a technique for efficiently tracing programs

Software—Practice & Experience
The Stanford Dash Multiprocessor

Computer
Comparative performance evaluation of cache-coherent NUMA and COMA architectures

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
DDM: A Cache-Only Memory Architecture

Computer
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
Replication and fault-tolerance in the ISIS system

Proceedings of the tenth ACM symposium on Operating systems principles
Tolerating node failures in cache only memory architectures

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Atomic Transactions

Distributed Systems - Architecture and Implementation, An Advanced Course
Fault Tolerance in Distributed Shared Memory Multiprocessors

Parallel Computer Architectures: Theory, Hardware, Software, Applications
An argument for simple COMA

HPCA '95 Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture
THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR

THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR
SPLASH: Stanford parallel applications for shared-memory

SPLASH: Stanford parallel applications for shared-memory

An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors

IEEE Transactions on Computers
Hardware fault containment in scalable shared-memory multiprocessors

Proceedings of the 24th annual international symposium on Computer architecture
A Survey of Recoverable Distributed Shared Virtual Memory Systems

IEEE Transactions on Parallel and Distributed Systems
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures

IEEE Transactions on Computers
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Load Balancing for Parallel Query Execution on NUMA Multiprocessors

Distributed and Parallel Databases
Modeling and evaluating the time overhead induced by BER in COMA multiprocessors

Journal of Systems Architecture: the EUROMICRO Journal
SWICH: A Prototype for Efficient Cache-Level Checkpointing and Rollback

IEEE Micro
Rebound: scalable checkpointing for coherent shared memory

Proceedings of the 38th annual international symposium on Computer architecture

Quantified Score

Hi-index	0.01

Visualization

Abstract

Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.