Support for Software Interrupts in Log-Based Rollback-Recovery

Authors:
J. Hamilton Slye;E. N. Elnozahy
Affiliations:
Bell Labs, Murray Hill, NJ;IBM Research Lab, Austin, TX
Venue:
IEEE Transactions on Computers
Year:
1998

Citing 22
Cited 4

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Cheap hardware support for software debugging and profiling

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
The V distributed system

Communications of the ACM
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
A software instruction counter

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Replay for concurrent non-deterministic shared-memory applications

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints

IEEE Transactions on Computers
Adaptive Message Logging for Incremental Program Replay

IEEE Parallel & Distributed Technology: Systems & Technology
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Distributed system fault tolerance using message logging and checkpointing

Distributed system fault tolerance using message logging and checkpointing
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
ExecRecorder: VM-based full-system replay for attack analysis and system recovery

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
FlashBox: a system for logging non-deterministic events in deployed embedded systems

Proceedings of the 2009 ACM symposium on Applied Computing
Hardware instruction counting for log-based rollback recovery on x86-family processors

ISAS'06 Proceedings of the Third international conference on Service Availability

Quantified Score

Hi-index	14.98

Visualization

Abstract

The piecewise deterministic execution model is a fundamental assumption in many log-based rollback-recovery protocols. Process execution in this model consists of intervals, each starting with the receipt of a message at an application-defined execution point. Execution within each interval is deterministic and messages are the only source of nondeterminism that affects the computation. This simple model excludes the nondeterminism that results when asynchronous signals or interrupts occur at arbitrary execution points. As a result, a wide range of applications cannot use log-based rollback-recovery in practice.We present a solution that removes this restriction and allows applications to replay interrupts at the same execution points during recovery. The solution relies on using a software counter to compute the number of instructions between the asynchronous signals during normal operation. Should a failure occur, the instruction counts are used to force the replay of these signals at the same execution points. The execution of the application thus can be replayed to recreate the prefailure state while accommodating nondeterminism due to asynchronous signals. We then use the deterministic replay of interrupts to solve another problem, namely tracking nondeterminism due to interleaved shared memory access in multithreaded applications on a single processor. We use the instruction counter solution to implement a user-level thread package in which thread scheduling decisions can be replayed if a failure occurs. By repeating the scheduling decisions during an execution replay, threads access the shared memory in the same order and the execution to be reconstructed. This technique allows multithreaded applications to use log-based rollback-recovery with low overhead, which was not previously possible. We carried out two prototype implementations that have shown the overhead is no more than a 6 percent slowdown in application execution on the DEC Alpha, and from 6 percent to 18 percent on the Intel Pentium. Thus, restrictions of the piecewise deterministic execution model can be lifted at a reasonable cost.