Adaptive Fault Management of Parallel Applications for High-Performance Computing

  • Authors:
  • Zhiling Lan;Yawei Li

  • Affiliations:
  • Illinois Institute of Technology, Chicago;Illinois Institute of Technology, Chicago

  • Venue:
  • IEEE Transactions on Computers
  • Year:
  • 2008

Quantified Score

Hi-index 14.98

Visualization

Abstract

As the scale of high performance computing (HPC) grows, application fault resilience becomes increasingly important. In this paper, we propose FT-Pro, an adaptive fault management approach that combines the merits of reactive checkpointing and proactive migration. It enables parallel applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. An adaptation manager is designed for making runtime decision in response to failure prediction. We evaluate FT-Pro through stochastic modeling and case studies with real applications under a wide range of settings. Preliminary results indicate that FT-Pro outperforms periodic checkpointing, in terms of both reducing application completion times and improving resource utilization, by up to 43%.