Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Cheap hardware support for software debugging and profiling
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Communications of the ACM
ACM Transactions on Computer Systems (TOCS)
A software instruction counter
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Replay for concurrent non-deterministic shared-memory applications
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints
IEEE Transactions on Computers
Adaptive Message Logging for Incremental Program Replay
IEEE Parallel & Distributed Technology: Systems & Technology
Supporting nondeterministic execution in fault-tolerant systems
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Distributed system fault tolerance using message logging and checkpointing
Distributed system fault tolerance using message logging and checkpointing
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
ExecRecorder: VM-based full-system replay for attack analysis and system recovery
Proceedings of the 1st workshop on Architectural and system support for improving software dependability
FlashBox: a system for logging non-deterministic events in deployed embedded systems
Proceedings of the 2009 ACM symposium on Applied Computing
Hardware instruction counting for log-based rollback recovery on x86-family processors
ISAS'06 Proceedings of the Third international conference on Service Availability
Hi-index | 14.98 |
The piecewise deterministic execution model is a fundamental assumption in many log-based rollback-recovery protocols. Process execution in this model consists of intervals, each starting with the receipt of a message at an application-defined execution point. Execution within each interval is deterministic and messages are the only source of nondeterminism that affects the computation. This simple model excludes the nondeterminism that results when asynchronous signals or interrupts occur at arbitrary execution points. As a result, a wide range of applications cannot use log-based rollback-recovery in practice.We present a solution that removes this restriction and allows applications to replay interrupts at the same execution points during recovery. The solution relies on using a software counter to compute the number of instructions between the asynchronous signals during normal operation. Should a failure occur, the instruction counts are used to force the replay of these signals at the same execution points. The execution of the application thus can be replayed to recreate the prefailure state while accommodating nondeterminism due to asynchronous signals. We then use the deterministic replay of interrupts to solve another problem, namely tracking nondeterminism due to interleaved shared memory access in multithreaded applications on a single processor. We use the instruction counter solution to implement a user-level thread package in which thread scheduling decisions can be replayed if a failure occurs. By repeating the scheduling decisions during an execution replay, threads access the shared memory in the same order and the execution to be reconstructed. This technique allows multithreaded applications to use log-based rollback-recovery with low overhead, which was not previously possible. We carried out two prototype implementations that have shown the overhead is no more than a 6 percent slowdown in application execution on the DEC Alpha, and from 6 percent to 18 percent on the Intel Pentium. Thus, restrictions of the piecewise deterministic execution model can be lifted at a reasonable cost.