Introduction to parallel algorithms and architectures: array, trees, hypercubes
Introduction to parallel algorithms and architectures: array, trees, hypercubes
An Algorithm for Subgraph Isomorphism
Journal of the ACM (JACM)
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
Fault tolerance is important for a distributed system to increase its reliability and throughput. Checkpoint and recovery protocols have been proposed as fault tolerance for non-critical applications. The performance of checkpoint and recovery protocols plays an important role in the overall performance of a distributed system. The performance of these protocols depends on system characteristics as well as an application characteristics. In this paper, we propose a novel technique to automatically identify the checkpoint and recovery protocol which is likely to perform the best for a given system and an application the system is currently running. We present experimental results to show that the scheme can efficiently determine a suitable checkpoint and recovery protocol for many applications.