Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

Authors:
Jinsong Ouyang;Piyush Maheshwari
Affiliations:
Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, jinsong_ouyang@hp.com;School of Computer Science & Engineering, University of New South Wales, Sydney 2052, Australia, piyush@cse.unsw.edu.au
Venue:
The Journal of Supercomputing
Year:
1999

Citing 26
Cited 1

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Checkpointing and Rollback-Recovery for Distributed Systems

IEEE Transactions on Software Engineering - Special issue on distributed systems
Concurrency control and recovery in database systems

Concurrency control and recovery in database systems
On distributed snapshots

Information Processing Letters
A recoverable object store

Proceedings of the Twenty-First Annual Hawaii International Conference on Software Track
Computer networks

Computer networks
Recovery in distributed systems using optimistic message logging and check-pointing

Journal of Algorithms
Efficient checkpointing on MIMD architectures

Efficient checkpointing on MIMD architectures
Efficient algorithms for distributed snapshots and global virtual time approximation

Journal of Parallel and Distributed Computing - Special issue on parallel and discrete event simulation
Compiler-assisted full checkpointing

Software—Practice & Experience
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
The Byzantine Generals Problem

ACM Transactions on Programming Languages and Systems (TOPLAS)
Fail-stop processors: an approach to designing fault-tolerant computing systems

ACM Transactions on Computer Systems (TOCS)
ickp: A Consistent Checkpointer for Multicomputers

IEEE Parallel & Distributed Technology: Systems & Technology
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
The performance of consistent checkpointing in distributed shared memory systems

SRDS '95 Proceedings of the 14TH Symposium on Reliable Distributed Systems
A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
How to recover efficiently and asynchronously when optimism fails

ICDCS '96 Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)
Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery

Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery
Algorithm-Based Diskless Checkpointing for Fault-Tolerant Matrix Operations

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Completely Asynchronous Optimistic Recovery with Minimal Rollbacks

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Checkpointing and Recovery for Distributed Shared Memory Applications

IWOOOS '95 Proceedings of the 4th International Workshop on Object-Orientation in Operating Systems
Supporting cost-effective fault tolerance in distributed applications with file operations

Supporting cost-effective fault tolerance in distributed applications with file operations
Libckpt: transparent checkpointing under Unix

TCON'95 Proceedings of the USENIX 1995 Technical Conference Proceedings

CprFS: a user-level file system to support consistent file states for checkpoint and restart

Proceedings of the 22nd annual international conference on Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.