Dynamic fault tolerance in DCMA-a dynamically configurable multicomputer architecture

Authors:
H. Kuefner;H. Baehring
Affiliations:
-;-
Venue:
SRDS '96 Proceedings of the 15th Symposium on Reliable Distributed Systems
Year:
1996

Citing 6
Cited 0

Fault tolerance: why should I pay for it?

Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
Fault-tolerant architectures—past, present and (?) future

Papers of the workshop on Hardware and software architectures for fault tolerance : experiences and perspectives: experiences and perspectives
Advanced Computer Architecture: Parallelism,Scalability,Programmability

Advanced Computer Architecture: Parallelism,Scalability,Programmability
Dependability: Basic Concepts and Terminology

Dependability: Basic Concepts and Terminology
Efficient Implementation Techniques for Gracefully Degradable Multiprocessor Systems

IEEE Transactions on Computers
A Performance Evaluation Study of Pipeline TMR Systems

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new architecture for a fault-tolerant computer system which connects high-end PCs or workstations by a high-speed network. To achieve platform independence, coupling is based on the widely used PCI-bus. In contrast to commercially available fault-tolerant systems we strongly emphasize mechanisms for tolerating transient and intermittent faults. To keep hardware costs low the system is built with off-the-shelf computers and their extensions are kept as small as possible. To reduce the operational costs the system can be dynamically adapted to different demands on fault tolerance on a program-by-program basis. Adaptation is done transparently to the application software by the operating system. We use a commercially available real-time operating system with a POSIX-compliant UNIX-interface. The bandwidth of fault tolerance reaches from a non-redundant system of stand-alone computers, a master/checker configuration to a TMR-system. The high-performance network allows the system to operate as a parallel multicomputer, too.