The consensus problem in fault-tolerant computing
ACM Computing Surveys (CSUR)
Reaching Agreement in the Presence of Faults
Journal of the ACM (JACM)
Real-Time Systems: Design Principles for Distributed Embedded Applications
Real-Time Systems: Design Principles for Distributed Embedded Applications
Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism
Fault-Tolerant Real-Time Systems: The Problem of Replica Determinism
Dependability: Basic Concepts and Terminology
Dependability: Basic Concepts and Terminology
Delta Four: A Generic Architecture for Dependable Distributed Computing
Delta Four: A Generic Architecture for Dependable Distributed Computing
Fault Tolerance: Why Should I Pay for It?
Revised Papers from a Workshop on Hardware and Software Architectures for Fault Tolerance
Difficulties Measuring Software Risk in an Industrial Environment
DSN '01 Proceedings of the 2001 International Conference on Dependable Systems and Networks (formerly: FTCS)
Estimating Bounds on the Reliability of Diverse Systems
IEEE Transactions on Software Engineering
Fault-Tolerant Broadcasts in CAN
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
A Columbus' Egg Idea for CAN Media Redundancy
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
The N-Version Approach to Fault-Tolerant Software
IEEE Transactions on Software Engineering
Hi-index | 0.00 |
Software is a major source of reliability degradation in dependable systems. One of the classical remedies is to provide software fault tolerance by using N-Version Programming (NVP). However, due to requirements on non-standard hardware and the need for changes and additions at all levels of the system, NVP solutions are costly, and have only been used in special cases. In a previous work, a low-cost architecture for NVP execution was developed. The key features of this architecture are the use of off-the-shelf components including communication standards and that the fault tolerance functionality, including voting, error detection, fault-masking, consistency management, and recovery, is moved into a separate redundancy management circuitry (one for each redundant computing node). In this article we present an improved design of that architecture, specifically resolving some potential inconsistencies that were not treated in detail in the original design. In particular, we present novel techniques for enforcing replica determinism. Our improved architecture is based on using the Controller Area Network (CAN). This choice goes beyond the obvious interest of using standards in order to reduce the cost, since all the rest of the architecture is designed to take full advantage of the CAN standard features, such as data consistency, in order to significantly reduce the complexity, the efficiency and the cost of the resultant system. Although initially developed for NVP, our redundancy management circuitry also supports other software replication techniques, such as active replication.