The process group approach to reliable distributed computing
Communications of the ACM
Totem: a fault-tolerant multicast group communication system
Communications of the ACM
The Transis approach to high availability cluster communication
Communications of the ACM
Horus: a flexible group communication system
Communications of the ACM
DEPEND: A Simulation-Based Environment for System Level Dependability Analysis
IEEE Transactions on Computers
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Reliable Distributed Computing with the ISIS Toolkit
Reliable Distributed Computing with the ISIS Toolkit
Concurrent Error Detection Using Watchdog Processors-A Survey
IEEE Transactions on Computers
The Chameleon Infrastructure for Adaptive, Software Implemented Fault Tolerance
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
AQuA: An Adaptive Architecture that Provides Dependable Distributed Objects
SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Micro-Checkpointing: Checkpointing for Multithreaded Applications
IOLTW '00 Proceedings of the 6th IEEE International On-Line Testing Workshop (IOLTW)
A system model for dynamically reconfigurable software
IBM Systems Journal
Towards automatic monitoring of component-based software systems
Journal of Systems and Software - Special issue: Automated component-based software engineering
Quantifying and Improving the Availability of High-Performance Cluster-Based Internet Services
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Hardware assisted pre-emptive control flow checking for embedded processors to improve reliability
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
Hi-index | 0.00 |
This paper proposes a hierarchical error detection framework for a Software Implemented Fault Tolerance (SIFT) layer of a distributed system. A four-level error detection hierarchy is proposed in the context of Chameleon, a software environment for providing adaptive fault-tolerance in an environment of commercial off-the-shelf (COTS) system components and software. The design and implementation of a software-based distributed signature monitoring scheme, which is central to the proposed four-level hierarchy, is described. Both intralevel and interlevel optimizations that minimize the overhead of detection and are capable of adapting to runtime requirements are proposed. The paper presents results from a prototype implementation of two levels of the error detection hierarchy and results of a detailed simulation of the overall environment. The results indicate a substantial increase in availability due to the detection framework and help in understanding the trade-offs between overhead and coverage for different combinations of techniques.