Fault propagation analysis based variable length checkpoint placement for fault-tolerant parallel and distributed systems

Authors:
Viral Shah;Sourav Bhattacharya
Affiliations:
-;-
Venue:
COMPSAC '97 Proceedings of the 21st International Computer Software and Applications Conference
Year:
1997

Citing 3
Cited 2

Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System

IEEE Transactions on Software Engineering
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture

IEEE Transactions on Computers
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks

IEEE Transactions on Parallel and Distributed Systems

A Component-Based Approach to Reliability Analysis of Distributed Systems

SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Scenario-Based Reliability Analysis of Component-Based Software

ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper proposes optimal checkpoint placement strategies using failure propagation analysis in a distributed rollback recovery system. The authors' previously proposed idea of failure propagation analysis (FPA) based checkpoint placement strategy is enhanced by incorporating link failures, task grouping/allocation, and loop stabilization aspects. Owing to the empirical observation that a large number of faults occur around message communication instructions, the checkpoint placement strategy places more checkpoints around message send/receive regions of the code. Allocation of tasks (or, threads) onto different processors can lead to varied communication patterns, which in turn can affect the FPA process and the checkpoint placement strategies. Thus, another key contribution of our research is to show the cyclic relationship between checkpointing and task allocation, as well as recursion in parallel or distributed programs. The proposed ideas and FPA approaches are illustrated using a typical parallel algorithm-the fast Fourier transform (FFT).