Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System
IEEE Transactions on Software Engineering
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture
IEEE Transactions on Computers
Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks
IEEE Transactions on Parallel and Distributed Systems
A Component-Based Approach to Reliability Analysis of Distributed Systems
SRDS '99 Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems
Scenario-Based Reliability Analysis of Component-Based Software
ISSRE '99 Proceedings of the 10th International Symposium on Software Reliability Engineering
Hi-index | 0.00 |
The paper proposes optimal checkpoint placement strategies using failure propagation analysis in a distributed rollback recovery system. The authors' previously proposed idea of failure propagation analysis (FPA) based checkpoint placement strategy is enhanced by incorporating link failures, task grouping/allocation, and loop stabilization aspects. Owing to the empirical observation that a large number of faults occur around message communication instructions, the checkpoint placement strategy places more checkpoints around message send/receive regions of the code. Allocation of tasks (or, threads) onto different processors can lead to varied communication patterns, which in turn can affect the FPA process and the checkpoint placement strategies. Thus, another key contribution of our research is to show the cyclic relationship between checkpointing and task allocation, as well as recursion in parallel or distributed programs. The proposed ideas and FPA approaches are illustrated using a typical parallel algorithm-the fast Fourier transform (FFT).