Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Reliable communication in the presence of failures
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
DAWGS—a distributed compute server utilizing idle workstations
Journal of Parallel and Distributed Computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Structuring Fault-Tolerant Object Systems for Modularity in a Distributed Environment
IEEE Transactions on Parallel and Distributed Systems
Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
GATOSTAR: A Fault Tolerant Load Sharing Facility for Parallel Applications
EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms
Proceedings of the Sixth International Conference on Data Engineering
The performance of independent checkpointing in distributed systems
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Hi-index | 0.00 |
Bertil Folliot This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment.