ACM Computing Surveys (CSUR)
SOSP '81 Proceedings of the eighth ACM symposium on Operating systems principles
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Fault-tolerant computing based on Mach
ACM SIGOPS Operating Systems Review
Understanding fault-tolerant distributed systems
Communications of the ACM
Stable transactional memories and fault tolerant architectures
ACM SIGOPS Operating Systems Review
A virtual memory translation mechanism to support checkpoint and rollback recovery
Proceedings of the 1991 ACM/IEEE conference on Supercomputing
An annotated bibliography of dependable distributed computing
ACM SIGOPS Operating Systems Review
Architecture, design, and performance of Application System/400 (AS/400) multiprocessors
IBM Journal of Research and Development
Virtual Checkpoints: Architecture and Performance
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Comparison of Duplex and Triplex Memory Reliability
IEEE Transactions on Computers
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
ISCA '96 Proceedings of the 23rd annual international symposium on Computer architecture
An Architecture for Tolerating Processor Failures in Shared-Memory Multiprocessors
IEEE Transactions on Computers
Implementing Fail-Silent Nodes for Distributed Systems
IEEE Transactions on Computers
Hiding distribution in distributed systems
ICSE '91 Proceedings of the 13th international conference on Software engineering
Hardware fault containment in scalable shared-memory multiprocessors
Proceedings of the 24th annual international symposium on Computer architecture
Journal of Electronic Testing: Theory and Applications - Special issue on On-line testing
Fault-Containment in Cache Memories for TMR Redundant Processor Systems
IEEE Transactions on Computers
An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures
IEEE Transactions on Computers
Fault-tolerance in the advanced automation system
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
Stable transactional memories and fault tolerant architectures
EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Tolerating node failures in cache only memory architectures
Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture
IEEE Transactions on Computers
A Fault Tolerant Hybrid Memory Structure and Memory Management Algorithms
IEEE Transactions on Computers
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
The Performance of Cache-Based Error Recovery in Multiprocessors
IEEE Transactions on Parallel and Distributed Systems
Resource Allocation for Primary-Site Fault-Tolerant Systems
IEEE Transactions on Software Engineering
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reducing Data Cache Susceptibility to Soft Errors
IEEE Transactions on Dependable and Secure Computing
Implementing high availability memory with a duplication cache
Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture
An Architecture for High Availability Multi-user Systems
Computer Communications
CSER: HW/SW configurable soft-error resiliency for application specific instruction-set processors
Proceedings of the Conference on Design, Automation and Test in Europe
RASTER: runtime adaptive spatial/temporal error resiliency for embedded processors
Proceedings of the 50th Annual Design Automation Conference
DHASER: dynamic heterogeneous adaptation for soft-error resiliency in ASIP-based multi-core systems
Proceedings of the International Conference on Computer-Aided Design
Hi-index | 4.14 |
The Sequoia computer is a tightly coupled multiprocessor that avoids most of the fault-tolerance disadvantages of tight coupling by using a fault-tolerant hardware-design approach. An overview is give of how the hardware architecture and operating system (OS) work together to provide a high degree of fault tolerance with good system performance. A description of hardware is followed by a discussion of the multiprocessor synchronization problem. Kernel support for fault recovery and the recovery process itself are examined. It is shown the kernel, through a combination of locking, shadowed memory, and controlled flushing of non-write-through cache, maintains a consistent main memory state recoverable from any single-point failure. The user shared memory is also discussed.