A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Authors:
Yudan Liu;Raja Nassar;Chockchai Leangsuksun;Nichamon Naksinehaboon;Mihaela Paun;Stephen Scott
Affiliations:
College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA;College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA;College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA;College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA;College of Engineering&Science, Louisiana Tech University, Ruston, LA 71270, USA;Computer Science and Mathematics Division, Oak Ridge National Laboratory, TN 37831, USA
Venue:
CLUSTER '07 Proceedings of the 2007 IEEE International Conference on Cluster Computing
Year:
2007

Citing 0
Cited 2

Checkpoint scheduling model for optimality

Information Processing Letters
Comparing checkpoint and rollback recovery schemes in a cluster system

ICA3PP'12 Proceedings of the 12th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

The increase in the physical size of High Performance Computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address fault tolerance challenge especially in a large-scale HPC system by providing optimal checkpoint placement techniques that are derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques.