Lightweight Fault-tolerance for Highly Cooperative Distributed Applications

  • Authors:
  • Lorenzo Alvisi;Sriram Rao;Harrick M. Vin

  • Affiliations:
  • -;-;-

  • Venue:
  • Lightweight Fault-tolerance for Highly Cooperative Distributed Applications
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

The recent introduction of high-speed networks, faster processors, and the rapid growth of heterogeneous large-scale distributed systems has enabled the development of distributed applications that move beyond the client-server model to truly harness the computational potential of distributed systems. These new applications will be structured around groups of agents that communicate using messages as well as files. Some of these emerging applications will be critical enough to life or business to warrant explicit process replication to achieve high availability. Often, however, explicit replication will be too costly to implement, or, simply, high availability will not be necessary. In these circumstances, the availability of low-overhead fault-tolerance techniques will be crucial to achieving reliability. To address these needs, we are developing lightweight fault-tolerance (LFT), a new low-overhead approach to fault-tolerance for highly cooperative distributed applications. In the first part of this paper, we describe how LFT extends to file communication the causal logging techniques used in message passing. We show how in our approach all the synchronous operations that are currently performed by log-based protocols during file I/O are either eliminated or made asynchronous, therefore removing the opportunities for blocking. Furthermore, we argue that our approach has the potential to enhance the effectiveness of existing rollback recovery techniques for software fault-tolerance. In the second part of the paper, we validate LFT through extensive simulation. Our results indicate that LFT brings the cost of file communication down to the level of message passing, drastically reducing the overhead incurred by fault-tolerant applications in performing file I/O.