System support for service availability, remote healing and fault tolerance using lazy state propagation

  • Authors:
  • Florin Sultan;Liviu Iftode

  • Affiliations:
  • Rutgers The State University of New Jersey - New Brunswick;Rutgers The State University of New Jersey - New Brunswick

  • Venue:
  • System support for service availability, remote healing and fault tolerance using lazy state propagation
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Our thesis is that lazy state propagation can be successfully used to implement efficient support for service availability, remote healing and fault tolerance. The end-to-end availability of an Internet service is currently constrained by the static client-server binding imposed by the TCP/IP protocol. To overcome this problem, we propose lazy migration of live client service sessions between equivalent servers. We have designed and implemented Service Continuations, an OS mechanism for session state migration between multi-process servers, along with Migratory TCP, a connection migration protocol that enables lazy session migration, and present experimental results with real Internet servers that validate the approach. Failure or damage to the state of the OS can lead to loss of critical application and OS state residing in system memory. As a solution to this problem, we propose remote healing through lazy recovery/repair actions on the in-memory software state of a computer system. To enable remote healing, we have designed and implemented Backdoors, a novel system architecture based on remote memory communication that allows access to resources of a machine even after an OS failure renders it unavailable. We present experimental results showing the Backdoors achieves efficient monitoring and fast recovery and repair. Distributed shared memory (DSM) systems used to run parallel applications on large commodity clusters are sensitive to individual node failures that compromise the whole computation. We have designed and implemented an efficient fault-tolerant DSM system for which we have developed two lazy algorithms for garbage collection of recovery state. We demonstrate through experiments with benchmark applications that our recovery support is light-weight and that lazy garbage collection effectively limits the amount of recovery state retained in the system.