Performance Evaluation of Fault Tolerance for Parallel Applications in Networked Environments

Authors:
Pierre Sens
Affiliations:
-
Venue:
ICPP '97 Proceedings of the international Conference on Parallel Processing
Year:
1997

Citing 15
Cited 0

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Reliable communication in the presence of failures

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
DAWGS—a distributed compute server utilizing idle workstations

Journal of Parallel and Distributed Computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Structuring Fault-Tolerant Object Systems for Modularity in a Distributed Environment

IEEE Transactions on Parallel and Distributed Systems
Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
GATOSTAR: A Fault Tolerant Load Sharing Facility for Parallel Applications

EDCC-1 Proceedings of the First European Dependable Computing Conference on Dependable Computing
Experimental Evaluation of Concurrency Checkpointing and Rollback-Recovery Algorithms

Proceedings of the Sixth International Conference on Data Engineering
The performance of independent checkpointing in distributed systems

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

Quantified Score

Hi-index	0.00

Visualization

Abstract

Bertil Folliot This paper presents the performance evaluation of a software fault manager for distributed applications. Dubbed STAR, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. STAR is application independent, highly configurable and easily portable to UNIX-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment.