Fault tolerant algorithms for heat transfer problems

Authors:
Hatem Ltaief;Edgar Gabriel;Marc Garbey
Affiliations:
Department of Computer Science, University of Houston, 210 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA;Department of Computer Science, University of Houston, 524 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA;Department of Computer Science, University of Houston, 501 Philip G. Hoffman Hall, Houston, TX 77204-3010, USA
Venue:
Journal of Parallel and Distributed Computing
Year:
2008

Citing 16
Cited 6

Asymptotic analysis on large timescales for singular perturbations of hyperbolic type

SIAM Journal on Mathematical Analysis
MPI: a message passing interface

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing
Harness: a next generation distributed virtual machine

Future Generation Computer Systems - Special issue on metacomputing
Stable, globally non-iterative, non-overlapping domain decomposition parallel solvers for parabolic problems

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Checkpointing and Its Applications

FTCS '95 Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing
Availability Study of Dynamic Voting Algorithms

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Distributed computing in practice: the Condor experience: Research Articles

Concurrency and Computation: Practice & Experience - Grid Performance
Fault tolerant high performance computing by a coding approach

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Latency tolerance through parallelization of time in scientific applications

Parallel Computing - Heterogeneous computing
Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

SIAM Journal on Scientific Computing
Super-Scalable algorithms for computing on 100,000 processors

ICCS'05 Proceedings of the 5th international conference on Computational Science - Volume Part I

A parallel Aitken-additive Schwarz waveform relaxation suitable for the grid

Parallel Computing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Algorithm-based recovery for iterative methods without checkpointing

Proceedings of the 20th international symposium on High performance distributed computing
A Robust and Efficient Message Passing Library for Volunteer Computing Environments

Journal of Grid Computing
Analyzing fault aware collective performance in a process fault tolerant MPI

Parallel Computing
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the emergence of new massively parallel systems in the high performance computing area allowing scientific simulations to run on thousands of processors, the mean time between failures of large machines is decreasing from several weeks to a few minutes. The ability of hardware and software components to handle these singular events called process failures is therefore getting increasingly important. In order for a scientific code to continue despite a process failure, the application must be able to retrieve the lost data items. The recovery procedure after failures might be fairly straightforward for elliptic and linear hyperbolic problems. However, the reversibility in time for parabolic problems appears to be the most challenging part because it is an ill-posed problem. This paper focuses on new fault-tolerant numerical schemes for the time integration of parabolic problems. The new algorithm allows the application to recover from process failures and to reconstruct numerically the lost data of the failed process(es) avoiding the expensive roll-back operation required in most checkpoint/restart schemes. As a fault tolerant communication library, we use the fault tolerant message passing interface developed by the Innovative Computing Laboratory at the University of Tennessee. Experimental results show promising performances. Indeed, the three-dimensional parabolic benchmark code is able to recover and to keep on running after failures, adding only a very small penalty to the overall time of execution.