Optimistic recovery in distributed systems
ACM Transactions on Computer Systems (TOCS)
Recovery in distributed systems using asynchronous message logging and checkpointing
PODC '88 Proceedings of the seventh annual ACM Symposium on Principles of distributed computing
Recoverable Distributed Shared Virtual Memory
IEEE Transactions on Computers
PVM: a framework for parallel distributed computing
Concurrency: Practice and Experience
Transparent fault-tolerance in parallel Orca programs
SEDMS III Papers from the symposium on Experiences with distributed and multiprocessor systems
Heterogeneous parallel programming in Jade
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
High-speed switch scheduling for local-area networks
ACM Transactions on Computer Systems (TOCS)
Adding fault-tolerant transaction processing to LINDA
Software—Practice & Experience
A checkpoint protocol for an entry consistent shared memory system
PODC '94 Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing
Distributed snapshots: determining global states of distributed systems
ACM Transactions on Computer Systems (TOCS)
A message system supporting fault tolerance
SOSP '83 Proceedings of the ninth ACM symposium on Operating systems principles
An Efficient Shared Memory Layer for Distributed Memory Machines.
An Efficient Shared Memory Layer for Distributed Memory Machines.
IEEE Transactions on Parallel and Distributed Systems
Cluster I/O with River: making the fast case common
Proceedings of the sixth workshop on I/O in parallel and distributed systems
ATLAS: an infrastructure for global computing
EW 7 Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications
Transparent Fault Tolerance for Web Services Based Architectures
Euro-Par '02 Proceedings of the 8th International Euro-Par Conference on Parallel Processing
Adaptive and reliable parallel computing on networks of workstations
ATEC '97 Proceedings of the annual conference on USENIX Annual Technical Conference
Using early phase termination to eliminate load imbalances at barrier synchronization points
Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems and applications
Performance analysis of mobile agent failure recovery in e-service applications
Computer Standards & Interfaces
Hi-index | 0.00 |
This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communication overheads in distributed memory environments and is implemented on a variety of distributed memory platforms. Our fundamental approach to providing fault tolerance is to ensure the replication of all data on more than one workstation using the dynamic caching already providedby SAM. The replicated data is accessible to the local processor like other cached data, making access to shared data faster and potentially offsetting some of the fault tolerance overhead. In addition, our method uses information available in SAM applications on how processes access shared data to enable several optimizations which reduce the fault-tolerance overhead. We have built an implementation of our fault-tolerance method in SAM for heterogeneous networks of workstations running PVM3. In this paper, we present our fault-tolerance method and describe its implementation in detail. We give performance results and overhead numbers for several large SAM applications running on a cluster of Alpha workstations connected by an ATM network. Our method is successful in providing transparent fault tolerance for parallel applications running on a network of workstations and is unique in requiring no global synchronizations and no disk operations to a reliable file server.