A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

Authors:
Zizhong Chen;Jack Dongarra
Affiliations:
-;-
Venue:
HASE '08 Proceedings of the 2008 11th IEEE High Assurance Systems Engineering Symposium
Year:
2008

Citing 0
Cited 3

Distributed Diskless Checkpoint for Large Scale Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from $ 2 \lceil \log p \rceil . k ((\beta + 2\gamma) m + \alpha)$ to $(1 + O (\frac{1}{\sqrt{m}} ) ) . k (\beta + 2\gamma) m$, where $\alpha$ is the communication latency, $\frac{1}{\beta}$is the network bandwidth between processes, $\frac{1}{\gamma}$ is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.