Transparent fault tolerance for parallel applications on networks of workstations

Authors:
Daniel J. Scales;Monica S. Lam
Affiliations:
Computer Systems Laboratory, Stanford University, Stanford, CA;Computer Systems Laboratory, Stanford University, Stanford, CA
Venue:
ATEC '96 Proceedings of the 1996 annual conference on USENIX Annual Technical Conference
Year:
1996

Citing 12
Cited 7

Optimistic recovery in distributed systems

ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing

PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recoverable Distributed Shared Virtual Memory

IEEE Transactions on Computers
PVM: a framework for parallel distributed computing

Concurrency: Practice and Experience
Transparent fault-tolerance in parallel Orca programs

SEDMS III Papers from the symposium on Experiences with distributed and multiprocessor systems
Heterogeneous parallel programming in Jade

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
High-speed switch scheduling for local-area networks

ACM Transactions on Computer Systems (TOCS)
Adding fault-tolerant transaction processing to LINDA

Software—Practice & Experience
A checkpoint protocol for an entry consistent shared memory system

PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems

ACM Transactions on Computer Systems (TOCS)
A message system supporting fault tolerance

SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
An Efficient Shared Memory Layer for Distributed Memory Machines.

An Efficient Shared Memory Layer for Distributed Memory Machines.

Diskless Checkpointing

IEEE Transactions on Parallel and Distributed Systems
Cluster I/O with River: making the fast case common

Proceedings of the sixth workshop on I/O in parallel and distributed systems
ATLAS: an infrastructure for global computing

EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Transparent Fault Tolerance for Web Services Based Architectures

Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Adaptive and reliable parallel computing on networks of workstations

ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Using early phase termination to eliminate load imbalances at barrier synchronization points

Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Performance analysis of mobile agent failure recovery in e-service applications

Computer Standards & Interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communication overheads in distributed memory environments and is implemented on a variety of distributed memory platforms. Our fundamental approach to providing fault tolerance is to ensure the replication of all data on more than one workstation using the dynamic caching already providedby SAM. The replicated data is accessible to the local processor like other cached data, making access to shared data faster and potentially offsetting some of the fault tolerance overhead. In addition, our method uses information available in SAM applications on how processes access shared data to enable several optimizations which reduce the fault-tolerance overhead. We have built an implementation of our fault-tolerance method in SAM for heterogeneous networks of workstations running PVM3. In this paper, we present our fault-tolerance method and describe its implementation in detail. We give performance results and overhead numbers for several large SAM applications running on a cluster of Alpha workstations connected by an ATM network. Our method is successful in providing transparent fault tolerance for parallel applications running on a network of workstations and is unique in requiring no global synchronizations and no disk operations to a reliable file server.