Supporting nondeterministic execution in fault-tolerant systems

Authors:
J. H. Slye;E. N. Elnozahy
Affiliations:
-;-
Venue:
FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
Year:
1996

Citing 21
Cited 17

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Cheap hardware support for software debugging and profiling

ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
The V distributed system

Communications of the ACM
Fault tolerance under UNIX

ACM Transactions on Computer Systems (TOCS)
A software instruction counter

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit

IEEE Transactions on Computers - Special issue on fault-tolerant computing
A case for two-level distributed recovery schemes

Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Hypervisor-based fault tolerance

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Replay for concurrent non-deterministic shared-memory applications

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Adaptive Message Logging for Incremental Program Replay

IEEE Parallel & Distributed Technology: Systems & Technology
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
An Efficient Technique for Tracking Nondeterministic Execution and its Applications

An Efficient Technique for Tracking Nondeterministic Execution and its Applications
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal

ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Distributed system fault tolerance using message logging and checkpointing

Distributed system fault tolerance using message logging and checkpointing
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

Manetho: fault tolerance in distributed systems using rollback-recovery and process replication

Support for Software Interrupts in Log-Based Rollback-Recovery

IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication

ICS '99 Proceedings of the 13th international conference on Supercomputing
Implementing E-Transactions with Asynchronous Replication

IEEE Transactions on Parallel and Distributed Systems
Fault-Tolerance: Java's Missing Buzzword

HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Enforcing Determinism for the Consistent Replication of Multithreaded CORBA Applications

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
ExecRecorder: VM-based full-system replay for attack analysis and system recovery

Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Flashback: a lightweight extension for rollback and deterministic replay for software debugging

ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Handling Emergent Nondeterminism in Replicated Services

Architecting Dependable Systems V
Living with nondeterminism in replicated middleware applications

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
FlashBox: a system for logging non-deterministic events in deployed embedded systems

Proceedings of the 2009 ACM symposium on Applied Computing
Practical and low-overhead masking of failures of TCP-based servers

ACM Transactions on Computer Systems (TOCS)
Architecting a chunk-based memory race recorder in modern CMPs

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Transparent, lightweight application execution replay on commodity multiprocessor operating systems

Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Deterministic process groups in dOS

OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Living with nondeterminism in replicated middleware applications

Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Transparent mutable replay for multicore debugging and patch validation

Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.