Finding a suitable checkpoint and recovery protocol for a distributed application
Journal of Parallel and Distributed Computing - Special issue: 18th International parallel and distributed processing symposium
ICPADS '06 Proceedings of the 12th International Conference on Parallel and Distributed Systems - Volume 1
HADAB: enabling fault tolerance in parallel applications running in distributed environments
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Hi-index | 0.00 |
Checkpointing is a very effective technique to tolerate the occurrence of failures in distributed and parallel applications. The existing algorithms in the literature are basically divided into two main classes: coordinated and independent checkpointing. This paper presents an experimental study that compares the performance of these two classes of algorithms. The main conclusion of our study is that coordinated checkpointing is more efficient than independent checkpointing and all the arguments against the performance of coordinated algorithms were not verified in practice.