High-Level Fault Tolerance in Distributed Programs

Authors:
Erik Seligman;Adam Beguelin
Affiliations:
-;-
Venue:
High-Level Fault Tolerance in Distributed Programs
Year:
1994

Citing 0
Cited 7

Server-directed collective I/O in Panda

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Scalable message passing in Panda

Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of the federated computing research conference
SCR algorithm: saving/restoring states of file systems

ACM SIGOPS Operating Systems Review
Process Recovery in Heterogeneous Systems

IEEE Transactions on Computers
Experiments with the CHIME Parallel Processing System

HiPC '00 Proceedings of the 7th International Conference on High Performance Computing
Dome: Parallel Programming in a Distributed Computing Environment

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Persistent Array Access Using Server-Directed I/O

SSDBM '96 Proceedings of the Eighth International Conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expected run times with low overhead, even in the case of frequent failures.