Optimistic Recovery in Multi-Threaded Distributed Systems

Authors:
Om P. Damani;Ashis Tarafdar;Vijay K. Garg
Affiliations:
-;-;-
Venue:
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Year:
1999

Citing 14
Cited 4

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Debugging Parallel Programs with Instant Replay

IEEE Transactions on Computers
A software instruction counter

ASPLOS III Proceedings of the third international conference on Architectural support for programming languages and operating systems
Debugging Concurrent Ada Programs by Deterministic Execution

IEEE Transactions on Software Engineering
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Threads primer: a guide to multithreaded programming

Threads primer: a guide to multithreaded programming
Replay for concurrent non-deterministic shared-memory applications

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Concurrent Programming in Java: Design Principles and Patterns

Concurrent Programming in Java: Design Principles and Patterns
Deriving Optimal Checkpoint Protocols for Distributed Shared Memory Architectures

Selected Papers from the International Workshop on Theory and Practice in Distributed Systems
Transparent Migration of Java-Based Mobile Agents

MA '98 Proceedings of the Second International Workshop on Mobile Agents
Supporting nondeterministic execution in fault-tolerant systems

FTCS '96 Proceedings of the The Twenty-Sixth Annual International Symposium on Fault-Tolerant Computing (FTCS '96)
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Distributed Recovery with K-Optimistic Logging

ICDCS '97 Proceedings of the 17th International Conference on Distributed Computing Systems (ICDCS '97)
Network-aware mobile programs

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference

Selective Checkpointing and Rollbacks in Multithreaded Distributed Systems

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Log-based recovery for middleware servers

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Log-based middleware server recovery with transaction support

The VLDB Journal — The International Journal on Very Large Data Bases
A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems

PDCAT'04 Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

The problem of recovering distributed systems from crash failures has been widely studied in the context of traditional non-threaded processes. However, extending those solutions to the multi-threaded scenario presents new problems. We identify and address these problems for optimistic logging protocols.There are two natural extension to optimistic logging protocols in the multi-threaded scenario. The first extension is "process-centric", where the points of internal non-determinism caused by threads are logged. The second extension is "thread-centric", where each thread is treated as a separate process. The process-centric approach suffers from false causality while the thread-centric approach suffers from high causality tracking overhead. By observing that the granularity of failures can be different from the granularity of rollbacks, we design a new "balanced" approach which incurs low causality tracking overhead and also eliminates false causality.