Fault tolerant algorithms for heat transfer problems

  • Authors:
  • Hatem Ltaief;Edgar Gabriel;Marc Garbey

  • Affiliations:
  • Department of Computer Science, University of Houston, 210 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA;Department of Computer Science, University of Houston, 524 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA;Department of Computer Science, University of Houston, 501 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA

  • Venue:
  • Journal of Parallel and Distributed Computing
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution.