Checkpointing vs. Migration for Post-Petascale Supercomputers

Authors:
Franck Cappello;Henri Casanova;Yves Robert
Affiliations:
-;-;-
Venue:
ICPP '10 Proceedings of the 2010 39th International Conference on Parallel Processing
Year:
2010

Citing 0
Cited 3

Checkpointing strategies for parallel jobs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Fault prediction under the microscope: a closer look into HPC systems

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

An alternative to classical fault-tolerant approaches for large-scale clusters is failure avoidance, by which the occurrence of a fault is predicted and a preventive measure is taken. We develop analytical performance models for two types of preventive measures: preventive checkpointing and preventive migration. We also develop an analytical model of the performance of a standard periodic checkpoint fault-tolerant approach. We instantiate these models for platform scenarios representative of current and future technology trends. We find that preventive migration is the better approach in the short term by orders of magnitude. However, in the longer term, both approaches have comparable merit with a marginal advantage for preventive checkpointing. We also find that standard non-prediction-based fault tolerance achieves poor scaling when compared to prediction-based failure avoidance, thereby demonstrating the importance of failure prediction capabilities. Finally, our results show that achieving good utilization in truly large-scale machines (e.g., 2^{20} nodes) for parallel workloads will require more than the failure avoidance techniques evaluated in this work.