Adaptive Fault Management of Parallel Applications for High-Performance Computing

Authors:
Zhiling Lan;Yawei Li
Affiliations:
Illinois Institute of Technology, Chicago;Illinois Institute of Technology, Chicago
Venue:
IEEE Transactions on Computers
Year:
2008

Citing 0
Cited 5

A study of dynamic meta-learning for failure prediction in large-scale systems

Journal of Parallel and Distributed Computing
Proactive process-level live migration and back migration in HPC environments

Journal of Parallel and Distributed Computing
Failure data-driven selective node-level duplication to improve MTTF in high performance computing systems

HPCS'09 Proceedings of the 23rd international conference on High Performance Computing Systems and Applications
Performance comparison under failures of MPI and MapReduce: An analytical approach

Future Generation Computer Systems
A job submission manager for large-scale distributed systems based on job futurity predictor

International Journal of Grid and Utility Computing

Quantified Score

Hi-index	14.98

Visualization

Abstract

As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive migration. It enables parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed for making runtime decision in response to failure prediction. We evaluate FT-Pro through stochastic modeling and case studies with real applications under a wide range of settings. Preliminary results indicate that FT-Pro outperforms periodic checkpointing, in terms of both reducing application completion times and improving resource utilization, by up to 43%.