Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Cheap hardware support for software debugging and profiling
ASPLOS II Proceedings of the second international conference on Architectual support for programming languages and operating systems
Communications of the ACM
ACM Transactions on Computer Systems (TOCS)
A software instruction counter
ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Efficient distributed recovery using message logging
Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Recovery in distributed systems using optimistic message logging and check-pointing
Journal of Algorithms
Manetho: Transparent Roll Back-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
IEEE Transactions on Computers - Special issue on fault-tolerant computing
A case for two-level distributed recovery schemes
Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
Hypervisor-based fault tolerance
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Replay for concurrent non-deterministic shared-memory applications
PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Adaptive Message Logging for Incremental Program Replay
IEEE Parallel & Distributed Technology: Systems & Technology
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
Publishing: a reliable broadcast communication mechanism
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
An Efficient Technique for Tracking Nondeterministic Execution and its Applications
An Efficient Technique for Tracking Nondeterministic Execution and its Applications
Why Optimistic Message Logging Has Not Been Used in Telecommunications Systems
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Reduced Overhead Logging for Rollback Recovery in Distributed Shared Memory
FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Message logging: pessimistic, optimistic, and causal
ICDCS '95 Proceedings of the 15th International Conference on Distributed Computing Systems
Distributed system fault tolerance using message logging and checkpointing
Distributed system fault tolerance using message logging and checkpointing
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Manetho: fault tolerance in distributed systems using rollback-recovery and process replication
Support for Software Interrupts in Log-Based Rollback-Recovery
IEEE Transactions on Computers
Fast cluster failover using virtual memory-mapped communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Implementing E-Transactions with Asynchronous Replication
IEEE Transactions on Parallel and Distributed Systems
Fault-Tolerance: Java's Missing Buzzword
HCW '98 Proceedings of the Seventh Heterogeneous Computing Workshop
Enforcing Determinism for the Consistent Replication of Multithreaded CORBA Applications
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Optimistic Recovery in Multi-Threaded Distributed Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
ExecRecorder: VM-based full-system replay for attack analysis and system recovery
Proceedings of the 1st workshop on Architectural and system support for improving software dependability
Flashback: a lightweight extension for rollback and deterministic replay for software debugging
ATEC '04 Proceedings of the annual conference on USENIX Annual Technical Conference
Handling Emergent Nondeterminism in Replicated Services
Architecting Dependable Systems V
Living with nondeterminism in replicated middleware applications
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
FlashBox: a system for logging non-deterministic events in deployed embedded systems
Proceedings of the 2009 ACM symposium on Applied Computing
Practical and low-overhead masking of failures of TCP-based servers
ACM Transactions on Computer Systems (TOCS)
Architecting a chunk-based memory race recorder in modern CMPs
Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Transparent, lightweight application execution replay on commodity multiprocessor operating systems
Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Deterministic process groups in dOS
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Living with nondeterminism in replicated middleware applications
Middleware'06 Proceedings of the 7th ACM/IFIP/USENIX international conference on Middleware
Transparent mutable replay for multicore debugging and patch validation
Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems
Hi-index | 0.00 |
We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.