System structure for software fault tolerance

Authors:
Brian Randell
Affiliations:
Computing Laboratory, University of Newcastle upon Tyne, Newcastle upon Tyne, England
Venue:
IEEE Transactions on Software Engineering
Year:
1975

Citing 0
Cited 23

Design and principles of a fault tolerant system

ICSE '78 Proceedings of the 3rd international conference on Software engineering
Role-based authorization in decentralized health care environments

Proceedings of the 2003 ACM symposium on Applied computing
An adaptive scheme for fault-tolerant scheduling of soft real-time tasks in multiprocessor systems

Journal of Parallel and Distributed Computing
Quasi-atomic recovery for distributed agents

Parallel Computing
Applying aspects to a real-time embedded operating system

Proceedings of the 6th workshop on Aspects, components, and patterns for infrastructure software
Frameworks for designing and implementing dependable systems using Coordinated Atomic Actions: A comparative study

Journal of Systems and Software
A mobile agent platform for distributed network and systems management

Journal of Systems and Software
N-version programming with imperfect debugging

Computers and Electrical Engineering
A mechanism for exception handling and its verification rules

Computer Languages
Conversations of objects

Computer Languages
Fail-safety techniques and their extensions to concurrent systems

Computer Languages
Fair distribution of concerns in design and evaluation of fault-tolerant distributed computer systems

Computer Communications
Research: Design of loosely coupled processes capable of time-bounded cooperative recovery: the PTC/SL scheme

Computer Communications
Research: Supporting fault-tolerant and open distributed processing using RPC

Computer Communications
Optimal checkpointing interval of a communication system with rollback recovery

Mathematical and Computer Modelling: An International Journal
Software fault tree analysis

Journal of Systems and Software
Availability analysis for the design of distributed processing networks

Journal of Systems and Software
On the use of embedded debug features for permanent and transient fault resilience in microprocessors

Microprocessors & Microsystems
A multi-cycle checkpointing protocol that ensures strict 1-rollback

Information Processing Letters
ChameleonSoft: Software Behavior Encryption for Moving Target Defense

Mobile Networks and Applications
Recovery within long-running transactions

ACM Computing Surveys (CSUR)
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

The Journal of Supercomputing
Supporting undoability in systems operations

LISA'13 Proceedings of the 27th international conference on Large Installation System Administration

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents and discusses the rationale behind a method for structuring complex computing systems by the use of what we term "recovery blocks," "conversations," and "fault" tolerant interfaces.' The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.