Checkpointing and Recovery for Distributed Shared Memory Applications

Authors:
Affiliations:
Venue:
IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Year:
1995

Citing 17
Cited 3

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
On distributed snapshots

Information Processing Letters
Efficient distributed recovery using message logging

Proceedings of the eighth annual ACM Symposium on Principles of distributed computing
Memory Access Dependencies in Shared-Memory Multiprocessors

IEEE Transactions on Software Engineering
Flush primitives for asynchronous distributed systems

Information Processing Letters
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Real-time, concurrent checkpoint for parallel programs

PPOPP '90 Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming
Lazy release consistency for software distributed shared memory

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
Efficient algorithms for distributed snapshots and global virtual time approximation

Journal of Parallel and Distributed Computing - Special issue on parallel and discrete event simulation
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
Efficient distributed shared memory based on multi-protocol release consistency

Efficient distributed shared memory based on multi-protocol release consistency
Weak ordering—a new definition

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Memory consistency and event ordering in scalable shared-memory multiprocessors

ISCA '90 Proceedings of the 17th annual international symposium on Computer Architecture
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs

IEEE Transactions on Computers

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

The Journal of Supercomputing
Atomic Commit in Concurrent Computing

IEEE Concurrency
Supporting fault-tolerance in heterogeneous distributed applications

HCW '97 Proceedings of the 6th Heterogeneous Computing Workshop (HCW '97)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: The paper proposes an approach for adding fault tolerance, based on consistent checkpointing, to distributed shared memory applications. Two different mechanisms are presented to efficiently address the issue of message losses due to either site failures or unreliable non-FIFO channels. Both guarantee a correct and efficient recovery from a consistent distributed system state following a failure. A variant of the two-phase commit protocol is employed such that the communication overhead required to take a consistent checkpoint is the same as that of systems using a one-phase commit protocol, while our protocol utilises stable storage more efficiently. A consistent checkpoint is committed when the first phase of the protocol finishes.