Hardware fault containment in scalable shared-memory multiprocessors

Authors:
Dan Teodosiu;Joel Baxter;Kinshuk Govil;John Chapin;Mendel Rosenblum;Mark Horowitz
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, MA and Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
Proceedings of the 24th annual international symposium on Computer architecture
Year:
1997

Citing 21
Cited 14

The LOCUS distributed system architecture

The LOCUS distributed system architecture
The Sprite Network Operating System

Computer
Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing

Computer
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Fault-Tolerant Computing: Fundamental Concepts

Computer
The SPARC architecture manual (version 9)

The SPARC architecture manual (version 9)
The Stanford FLASH multiprocessor

ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Hive: fault containment for shared-memory multiprocessors

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace

ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
Implementing efficient fault containment for multiprocessors: confining faults in a shared-memory multiprocessor environment

Communications of the ACM
The Mercury Interconnect Architecture: a cost-effective infrastructure for high-performance servers

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
Fast estimation of diameter and shortest paths (without matrix multiplication)

Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Distributed Algorithms

Distributed Algorithms
Complete Computer System Simulation: The SimOS Approach

IEEE Parallel & Distributed Technology: Systems & Technology
Improving Application Performance on the HP/Convex Exemplar

Computer
The MIPS R10000 Superscalar Microprocessor

IEEE Micro
Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels

IEEE Transactions on Parallel and Distributed Systems
Solaris MC: a multi computer OS

ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference

Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors

Proceedings of the seventeenth ACM symposium on Operating systems principles
A Testbed for Evaluation of Fault-Tolerant Routing in Multiprocessor Interconnection Networks

IEEE Transactions on Parallel and Distributed Systems
Cellular disco: resource management using virtual clusters on shared-memory multiprocessors

ACM Transactions on Computer Systems (TOCS)
Architecture and design of AlphaServer GS320

ACM SIGPLAN Notices
Architecture and design of AlphaServer GS320

ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing relevance of memory hardware errors: a case for recoverable programming models

EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
The Double Scheme: Deadlock-free Dynamic Reconfiguration of Cut-Through Networks

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability

IEEE Transactions on Parallel and Distributed Systems
Software environment for integrating critical real-time control systems

Journal of Systems Architecture: the EUROMICRO Journal
Part II: A Methodology for Developing Deadlock-Free Dynamic Network Reconfiguration Processes

IEEE Transactions on Parallel and Distributed Systems
Global memory management for a multi computer system

WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration

Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Simple deadlock-free dynamic network reconfiguration

HiPC'04 Proceedings of the 11th international conference on High Performance Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.