The LOCUS distributed system architecture
The LOCUS distributed system architecture
The Sprite Network Operating System
Computer
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
The SPARC architecture manual (version 9)
The SPARC architecture manual (version 9)
The Stanford FLASH multiprocessor
ISCA '94 Proceedings of the 21st annual international symposium on Computer architecture
The performance impact of flexibility in the Stanford FLASH multiprocessor
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Hive: fault containment for shared-memory multiprocessors
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
STiNG: a CC-NUMA computer system for the commercial marketplace
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
The Mercury Interconnect Architecture: a cost-effective infrastructure for high-performance servers
Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server
Proceedings of the 24th annual international symposium on Computer architecture
Fast estimation of diameter and shortest paths (without matrix multiplication)
Proceedings of the seventh annual ACM-SIAM symposium on Discrete algorithms
Distributed Algorithms
Complete Computer System Simulation: The SimOS Approach
IEEE Parallel & Distributed Technology: Systems & Technology
The MIPS R10000 Superscalar Microprocessor
IEEE Micro
Fault-Tolerant Wormhole Routing in Meshes without Virtual Channels
IEEE Transactions on Parallel and Distributed Systems
Solaris MC: a multi computer OS
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Cellular Disco: resource management using virtual clusters on shared-memory multiprocessors
Proceedings of the seventeenth ACM symposium on Operating systems principles
A Testbed for Evaluation of Fault-Tolerant Routing in Multiprocessor Interconnection Networks
IEEE Transactions on Parallel and Distributed Systems
Cellular disco: resource management using virtual clusters on shared-memory multiprocessors
ACM Transactions on Computer Systems (TOCS)
Architecture and design of AlphaServer GS320
ACM SIGPLAN Notices
Architecture and design of AlphaServer GS320
ASPLOS IX Proceedings of the ninth international conference on Architectural support for programming languages and operating systems
ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Increasing relevance of memory hardware errors: a case for recoverable programming models
EW 9 Proceedings of the 9th workshop on ACM SIGOPS European workshop: beyond the PC: new challenges for the operating system
The Double Scheme: Deadlock-free Dynamic Reconfiguration of Cut-Through Networks
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability
IEEE Transactions on Parallel and Distributed Systems
Software environment for integrating critical real-time control systems
Journal of Systems Architecture: the EUROMICRO Journal
Part II: A Methodology for Developing Deadlock-Free Dynamic Network Reconfiguration Processes
IEEE Transactions on Parallel and Distributed Systems
Global memory management for a multi computer system
WSS'00 Proceedings of the 4th conference on USENIX Windows Systems Symposium - Volume 4
RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Simple deadlock-free dynamic network reconfiguration
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Hi-index | 0.00 |
Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive fault-tolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors.