Toward Systematic Design of Fault-Tolerant Systems

Authors:
Algirdas Avizienis
Affiliations:
-
Venue:
Computer
Year:
1997

Citing 2
Cited 25

Fault-tolerance design of the IBM Enterprise System/9000 Type 9021 processors

IBM Journal of Research and Development
Fault-tolerance in the advanced automation system

EW 4 Proceedings of the 4th workshop on ACM SIGOPS European workshop

Stabilizing Pre-Run-Time Schedules With the Help of GraceTime

Real-Time Systems
Embryonics: A Bio-Inspired Cellular Architecture with Fault-Tolerant Properties

Genetic Programming and Evolvable Machines
An architectural-based reflective approach to incorporating exception handling into dependable software

Advances in exception handling techniques
Self-Repairing Multicellular Hardware: A Reliability Analysis

ECAL '99 Proceedings of the 5th European Conference on Advances in Artificial Life
Novel Approaches in Dependable Computing

EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
An Immune System Paradigm for the Design of Fault Tolerant Systems

EDCC-4 Proceedings of the 4th European Dependable Computing Conference on Dependable Computing
Immunotronics: Hardware Fault Tolerance Inspired by the Immune System

ICES '00 Proceedings of the Third International Conference on Evolvable Systems: From Biology to Hardware
Understanding Inherent Qualities of Evolved Circuits: Evolutionary History as a Predictor of Fault Tolerance

ICES '00 Proceedings of the Third International Conference on Evolvable Systems: From Biology to Hardware
Untidy Evolution: Evolving Messy Gates for Fault Tolerance

ICES '01 Proceedings of the 4th International Conference on Evolvable Systems: From Biology to Hardware
An Architectural-Based Reflective Approach to Incorporating Exception Handling into Dependable Software

Advances in Exception Handling Techniques (the book grow out of a ECOOP 2000 workshop)
What Designers of Bus and Network Architectures Should Know about Hypercubes

IEEE Transactions on Computers
On-Board Maintenance for Long-Life Systems

ASSET '98 Proceedings of the 1998 IEEE Workshop on Application - Specific Software Engineering and Technology
Describing Evolving Dependable Systems using Co-operative Software Architectures

ICSM '01 Proceedings of the IEEE International Conference on Software Maintenance (ICSM'01)
Reflections on Industry Trends and Experimental Research in Dependability

IEEE Transactions on Dependable and Secure Computing
SEU tolerant device, circuit and processor design

Proceedings of the 42nd annual Design Automation Conference
Stigmergic approaches applied to flexible fault-tolerant digital VLSI architectures

Journal of Parallel and Distributed Computing - Special issue on parallel bioinspired algorithms
Evaluating recovery aware components for grid reliability

Proceedings of the the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering
Autonomic fault mitigation in embedded systems

Engineering Applications of Artificial Intelligence
Reliability and path length analysis of irregular fault tolerant multistage interconnection network

ACM SIGARCH Computer Architecture News
Attaining fault tolerance through self-adaption: the strengths and weaknesses of evolvable hardware approaches

WCCI'08 Proceedings of the 2008 IEEE world conference on Computational intelligence: research frontiers
Achieving software robustness via large-scale multiagent systems

Software engineering for large-scale multi-agent systems
Object-oriented architecture for digital pulse shape acquisition from AZ/4π detectors: a case study

RTC'05 Proceedings of the 14th IEEE-NPSS conference on Real time
Formal development of reactive fault tolerant systems

RISE'05 Proceedings of the Second international conference on Rapid Integration of Software Engineering Techniques
Immunising automated teller machines

ICARIS'05 Proceedings of the 4th international conference on Artificial Immune Systems
The conflict between self-* capabilities and predictability

Self-star Properties in Complex Information Systems

Quantified Score

Hi-index	4.10

Visualization

Abstract

The mid-century "space race" was a major impetus for the development of fault-tolerant computing. Over the succeeding 25 years researchers expanded the concept of fault tolerance and refined the techniques for achieving it. Nevertheless, the bottom-up approach, entailing an infrastructure of autonomously fault-tolerant subsystems integrated with global fault tolerance functions, is less common today than the top-down approach, which relies on off-the-shelf (OTS) subsystems and a global monitoring function. A design paradigm for the systematic treatment of fault tolerance involves four steps: specification, implementation, evaluation, and modification. The paradigm offers a way to minimize the probability of oversights, mistakes, and inconsistencies that may occur during the implementation of fault tolerance. In spite of the long-range merits of this bottom-up approach, time and cost constraints often lead developers to use OTS subsystems when designing systems that are expected to be highly dependable. Even the Pentium Pro, which appears to have the most complete set of fault tolerance functions among contemporary microprocessors, has major drawbacks. Moreover, systems built from OTS subsystems are difficult to retrofit for fault tolerance. Without hardware support for fault tolerance, the only solution is to build a software monitor subsystem that tries to check all subsystems for indications of failure. But the monitor itself is unprotected because it resides and executes on an OTS processor. Researchers would do well to consider the human immune system as a model for systems in which fault tolerance is an integral attribute of every hardware element.