Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems
IEEE Transactions on Software Engineering - Special issue on distributed systems
Atomic snapshots of shared memory
Journal of the ACM (JACM)
The Unified Modeling Language user guide
The Unified Modeling Language user guide
Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
IEEE Transactions on Parallel and Distributed Systems
Hierarchical Error Detection in a Software Implemented Fault Tolerance (SIFT) Environment
IEEE Transactions on Knowledge and Data Engineering
Error Recovery in Shared Memory Multiprocessors Using Private Caches
IEEE Transactions on Parallel and Distributed Systems
Software Dependability in the Tandem GUARDIAN System
IEEE Transactions on Software Engineering
A User-level Checkpointing Library for POSIX Threads Programs
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Libckpt: transparent checkpointing under Unix
TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings
Measurement-Based Analysis of System Dependability Using Fault Injection and Field Failure Data
Performance Evaluation of Complex Systems: Techniques and Tools, Performance 2002, Tutorial Lectures
IEEE Transactions on Software Engineering
A system model for dynamically reconfigurable software
IBM Systems Journal
Microreboot — A technique for cheap recovery
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
LEAN: simplifying concurrency bug reproduction via replay-supported execution reduction
Proceedings of the ACM international conference on Object oriented programming systems languages and applications
Hi-index | 0.00 |
In this paper, we introduce an efficient technique for checkpointing multithreaded applications. Our approach makes use of processes constructed around the ARMOR (Adaptive Reconfigurable Mobile Objects of Reliability) paradigm implemented in our Chameleon testbed. ARMOR processes are composed of disjoint elements (objects) with controlled manipulation of element state. These characteristics of ARMOR's allow the process state to be collected during runtime in an efficient manner and saved to disk when necessary. We call this approach micro-checkpointing.We demonstrate micro-checkpointing in the Chameleon testbed; an environment for developing reliable distributed applications. Our results show that the overhead ranges from between 39% to 141% with an aggressive checkpointing policy, depending upon the degree to which the process conforms to our ARMOR paradigm.