A fault tolerance protocol for stateless parallel processing

  • Authors:
  • Yuan Shi;Yijian Yang

  • Affiliations:
  • Temple University;Temple University

  • Venue:
  • A fault tolerance protocol for stateless parallel processing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Application fault tolerance is defined as a property that enables any high performance computing (HPC) application to continue operating in the event of single or multiple component failures in the HPC processing system. Note that a fault tolerant HPC system typically does not support application fault tolerance since it merely isolates the failed components without providing non-stopping services to the running applications. Exception handlers in languages such as C++, Java and C# are a form of fault-tolerance programming. Unfortunately, for HPC applications, these tools are insufficient to provide application fault tolerance.Stateless Parallel Processing (SPP) refers to a multi-processor architecture and a parallel programming paradigm that contain the minimal number of exposed application states. A dataflow machine is naturally SPP. A tuple-space based parallel processing system, such as Linda or Synergy, is also SPP.A SPP application contains two types of components: stateless workers and stateful masters. Unlike message-passing and shared memory parallel processing systems, a SPP system provides high performance and high availability at the same time via an enhanced inter-processor communication layer.This dissertation focuses on the development of a low overhead fault tolerance method for protecting both stateless and stateful components in a SPP application. For the stateless component, shadow tuple along with shadow state recovery is designed with very small overhead; while, for the stateful component, a system-level non-blocking coordinated checkpoint/restart mechanism with socket support is devised to achieve the goal. A prototype of the proposed fault tolerance protocol for SPP has been implemented for the purpose of performance research. Results of the experiments performed using this prototype show the proposed fault tolerance protocol not only has better performance and scalability than any blocking protocols, but also has a provably small overhead.