The design of the UNIX operating system
The design of the UNIX operating system
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
ACM Transactions on Graphics (TOG)
ACM Transactions on Computer Systems (TOCS)
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
Application-transparent fault management
Application-transparent fault management
Time, clocks, and the ordering of events in a distributed system
Communications of the ACM
Efficient Algorithms for Crash Recovery in Distributed Systems
Proceedings of the Tenth Conference on Foundations of Software Technology and Theoretical Computer Science
Application-transparent checkpointing in Mach 3.O/UX
HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
IEEE Transactions on Parallel and Distributed Systems
CLIP: a checkpointing tool for message-passing parallel programs
SC '97 Proceedings of the 1997 ACM/IEEE conference on Supercomputing
Duplex method for mobile communication systems
MSN'05 Proceedings of the First international conference on Mobile Ad-hoc and Sensor Networks
Compiler-Assisted Checkpointing of Parallel Codes: The Cetus and LLVM Experience
International Journal of Parallel Programming
Hi-index | 0.00 |
Abstract: The concept of middleware provides a transparent way to augment and change the characteristics of a service provider as seen from a client. Fault tolerant policies are ideal candidates for middleware implementation. We have defined and implemented operating system based middleware support that provides the power and flexibility needed by diverse fault tolerant policies. This mechanism, called the sentry, has been built into the UNIX 4.3 BSD operating system server running on a Mach 3.0 kernel. To demonstrate the effectiveness of the mechanism several policies have been implemented using sentries including checkpointing and journaling. The implementation shows that complex fault tolerant policies can be efficiently and transparently implemented as middleware. Performance overhead of input journaling is less than 5% and application suspension during the checkpoint is typically under 10 seconds in length. A standard hard disk is used to store journal and checkpoint information with dedicated storage requirements of less than 20 MB.