On modeling and tolerating incorrect software

Authors:
Anish Arora;Marvin Theimer
Affiliations:
The Ohio State University, Columbus, OH 43214, USA E-mail: anish@cis.ohio-state.edu;Microsoft Research, Redmond, WA 98052, USA E-mail: theimer@microsoft.com
Venue:
Journal of High Speed Networks - Self-Stabilizing Systems, Part 2
Year:
2005

Citing 16
Cited 0

Distributed computing: models and methods

Handbook of theoretical computer science (vol. B)
Unreliable failure detectors for reliable distributed systems

Journal of the ACM (JACM)
Component Based Design of Multitolerant Systems

IEEE Transactions on Software Engineering
Enforceable security policies

ACM Transactions on Information and System Security (TISSEC)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
Self-stabilizing systems in spite of distributed control

Communications of the ACM
Transient Fault Detectors

DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Atomic Transactions

Distributed Systems - Architecture and Implementation, An Advanced Course
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software

DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
How Fail-Stop are Faulty Programs?

FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Auditdraw: Generating Audits the FAST Way

RE '97 Proceedings of the 3rd IEEE International Symposium on Requirements Engineering
Detectors and Correctors: A Theory of Fault-Tolerance Components

ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Software Rejuvenation: Analysis, Module and Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Herald: Achieving a Global Event Notification Service

HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Failure detection and consensus in the crash-recovery model

Distributed Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributed systems have to deal with the following scenarios in practice: bugs in components; incorrect specifications of components and, therefore, incorrect use of components; unanticipated faults due to complex interactions or to not containing the effects of faults in lower-level components; and evolution of components. Extant fault tolerance models deal with such scenarios in only a limited manner. In particular, we point out that state corruption is inevitable in practice and that therefore one must accept it and seek to correct it. The well-known concepts of detectors and correctors can be used to find and repair state corruption. However, these concepts have traditionally been employed to immediately detect and correct errors caused by misbehaving system components. Immediate detection and correction is often too expensive to perform and hence we consider the implications of running detectors and correctors only intermittently. More specifically, we address issues that must be dealt with when state corruption may persist within a system for a period of time. We show how to both detect and correct state corruption caused by infrequently occurring “transient” errors despite the ability for it to actively spread to other parts of the system. We also show how to eventually detect all state corruption, even in cases where continually recurring errors are constantly introducing new state corruption. Finally, we discuss the minimum set of capabilities needed from a trusted base of software in order to guarantee the correctness of our algorithms.