Fault Tolerance for Off-the-Shelf Applications and Hardware

Authors:
M. Russinovich;Z. Segall
Affiliations:
-;-
Venue:
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Year:
1995

Citing 10
Cited 5

The design of the UNIX operating system

The design of the UNIX operating system
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
The X window system

ACM Transactions on Graphics (TOG)
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Application-transparent fault management

Application-transparent fault management
Time, clocks, and the ordering of events in a distributed system

Communications of the ACM
Efficient Algorithms for Crash Recovery in Distributed Systems

Proceedings of the Tenth Conference on Foundations of Software Technology and Theoretical Computer Science
Application-transparent checkpointing in Mach 3.O/UX

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
CLIP: a checkpointing tool for message-passing parallel programs

SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Remediation of Application-Specific Security Vulnerabilities at Runtime

IEEE Software
Duplex method for mobile communication systems

MSN'05 Proceedings of the First international conference on Mobile Ad-hoc and Sensor Networks
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience

International Journal of Parallel Programming

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.