Distributed computing: models and methods
Handbook of theoretical computer science (vol. B)
Unreliable failure detectors for reliable distributed systems
Journal of the ACM (JACM)
Component Based Design of Multitolerant Systems
IEEE Transactions on Software Engineering
ACM Transactions on Information and System Security (TISSEC)
The Byzantine Generals Problem
ACM Transactions on Programming Languages and Systems (TOPLAS)
Fail-stop processors: an approach to designing fault-tolerant computing systems
ACM Transactions on Computer Systems (TOCS)
Self-stabilizing systems in spite of distributed control
Communications of the ACM
DISC '98 Proceedings of the 12th International Symposium on Distributed Computing
Distributed Systems - Architecture and Implementation, An Advanced Course
Whither Generic Recovery from Application Faults? A Fault Study using Open-Source Software
DSN '00 Proceedings of the 2000 International Conference on Dependable Systems and Networks (formerly FTCS-30 and DCCA-8)
How Fail-Stop are Faulty Programs?
FTCS '98 Proceedings of the The Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing
Auditdraw: Generating Audits the FAST Way
RE '97 Proceedings of the 3rd IEEE International Symposium on Requirements Engineering
Detectors and Correctors: A Theory of Fault-Tolerance Components
ICDCS '98 Proceedings of the The 18th International Conference on Distributed Computing Systems
Software Rejuvenation: Analysis, Module and Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Herald: Achieving a Global Event Notification Service
HOTOS '01 Proceedings of the Eighth Workshop on Hot Topics in Operating Systems
Failure detection and consensus in the crash-recovery model
Distributed Computing
Hi-index | 0.00 |
Distributed systems have to deal with the following scenarios in practice: bugs in components; incorrect specifications of components and, therefore, incorrect use of components; unanticipated faults due to complex interactions or to not containing the effects of faults in lower-level components; and evolution of components. Extant fault tolerance models deal with such scenarios in only a limited manner. In particular, we point out that state corruption is inevitable in practice and that therefore one must accept it and seek to correct it. The well-known concepts of detectors and correctors can be used to find and repair state corruption. However, these concepts have traditionally been employed to immediately detect and correct errors caused by misbehaving system components. Immediate detection and correction is often too expensive to perform and hence we consider the implications of running detectors and correctors only intermittently. More specifically, we address issues that must be dealt with when state corruption may persist within a system for a period of time. We show how to both detect and correct state corruption caused by infrequently occurring “transient” errors despite the ability for it to actively spread to other parts of the system. We also show how to eventually detect all state corruption, even in cases where continually recurring errors are constantly introducing new state corruption. Finally, we discuss the minimum set of capabilities needed from a trusted base of software in order to guarantee the correctness of our algorithms.