Trading off logging overhead and coordinating overhead to achieve efficient rollback recovery

Authors:
Jin-Min Yang;Kin Fun Li;Wen-Wei Li;Da-Fang Zhang
Affiliations:
Software School, Hunan University, Changsha 410082, China;Electrical and Computer Engineering, University of Victoria, Victoria, BC, Canada V8W 3P6;Software School, Hunan University, Changsha 410082, China;Software School, Hunan University, Changsha 410082, China
Venue:
Concurrency and Computation: Practice & Experience
Year:
2009

Citing 0
Cited 1

On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the rollback recovery of large-scale long-running applications in a distributed environment, pessimistic message logging protocols enable failed processes to recover independently, though at the expense of logging every message synchronously during fault-free execution. In contrast, coordinated checkpointing protocols avoid message logging, but they are poor in scalability with a sharply increased coordinating overhead as the system grows. With the aim of achieving efficient rollback recovery by trading off logging overhead and coordinating overhead, this paper suggests a partitioning of the system into clusters, and then presents a scheme to implement the conversion between these overheads. Using the proposed conversion, coordination can be introduced to reduce the unbearable logging overhead found in some systems, whereas proper logging can be employed to alleviate the unacceptable coordinating overhead in others. Furthermore, heuristics are introduced to address the issue of how to partition the system into clusters in order to speed up the recovery process and to improve recovery efficiency. Performance evaluation results indicate that our scheme can lower the overall system overhead effectively. Copyright © 2008 John Wiley & Sons, Ltd.