Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

  • Authors:
  • Jinsong Ouyang;Piyush Maheshwari

  • Affiliations:
  • Performance Technology Center, Hewlett-Packard Company, Roseville, CA 95747, jinsong_ouyang@hp.com;School of Computer Science & Engineering, University of New South Wales, Sydney 2052, Australia, piyush@cse.unsw.edu.au

  • Venue:
  • The Journal of Supercomputing
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.