Validation of recoverable concurrent software systems based on the programmer-transparent coordination scheme
Fault-Tolerant Software for Real-Time Applications
ACM Computing Surveys (CSUR)
Monitors: an operating system structuring concept
Communications of the ACM
Reliable Computer Systems
The architecture of concurrent programs
The architecture of concurrent programs
A program structure for error detection and recovery
Operating Systems, Proceedings of an International Symposium
Cooperating sequential processes
The origin of concurrent programming
Structure of an efficient duplex memory for processing fault-tolerant programs
ISCA '78 Proceedings of the 5th annual symposium on Computer architecture
Recovery blocks in action: A system supporting high reliability
ICSE '76 Proceedings of the 2nd international conference on Software engineering
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
SUVS: a distributed real-time system testbed for fault-tolerant computing
SAC '92 Proceedings of the 1992 ACM/SIGAPP symposium on Applied computing: technological challenges of the 1990's
Optimistic Crash Recovery without Changing Application Messages
IEEE Transactions on Parallel and Distributed Systems
A Gracefully Degrading Massively Parallel System Using the BSP Model, and Its Evaluation
IEEE Transactions on Computers
Quasi-Synchronous Checkpointing: Models, Characterization, and Classification
IEEE Transactions on Parallel and Distributed Systems
Low-Cost Error Containment and Recovery for Onboard Guarded Software Upgrading and Beyond
IEEE Transactions on Computers - Special issue on fault-tolerant embedded systems
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Using Petri Nets for the Design of Conversation Boundaries in Fault-Tolerant Software
IEEE Transactions on Parallel and Distributed Systems
CSP Methods for Identifying Atomic Actions in the Design of Fault Tolerant Concurrent Systems
IEEE Transactions on Software Engineering
Gracefully Degrading Systems Using the Bulk-Synchronous Parallel Model with Randomised Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Efficient algorithms for optimistic crash recovery
Distributed Computing
Hi-index | 0.01 |
An approach to coordination of cooperating concurrent processes, each capable of error direction and recovery, is presented. Error detection, rollback, and retry in a process are specified by a well-structured language construct called recovery block. Recovery points of processes must be properly coordinated to prevent a disastrous avalanche of process rollbacks. The approach relies on an intelligent processor system (that runs processes) capable of establishing and discarding the recovery points of interacting processes in a well coordinated manner such that a process never makes two consecutive rollbacks without making a retry between the two, and every process rollback becomes a minimum-distance rollback. Following a discussion of the underlying philosophy of the author's approach, basic rules of reducing storage and time overhead in such a processor system are discussed. Examples are drawn from the systems in which processes communicate through monitors