Fault propagation analysis based variable length checkpoint placement for fault-tolerant parallel and distributed systems

  • Authors:
  • Viral Shah;Sourav Bhattacharya

  • Affiliations:
  • -;-

  • Venue:
  • COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper proposes optimal checkpoint placement strategies using failure propagation analysis in a distributed rollback recovery system. The authors' previously proposed idea of failure propagation analysis (FPA) based checkpoint placement strategy is enhanced by incorporating link failures, task grouping/allocation, and loop stabilization aspects. Owing to the empirical observation that a large number of faults occur around message communication instructions, the checkpoint placement strategy places more checkpoints around message send/receive regions of the code. Allocation of tasks (or, threads) onto different processors can lead to varied communication patterns, which in turn can affect the FPA process and the checkpoint placement strategies. Thus, another key contribution of our research is to show the cyclic relationship between checkpointing and task allocation, as well as recursion in parallel or distributed programs. The proposed ideas and FPA approaches are illustrated using a typical parallel algorithm-the fast Fourier transform (FFT).