enhancing fault-tolerance of large-scale MPI scientific applications

Authors:
G. Rodríguez;P. González
Affiliations:
Computer Architecture Group, Dep. Electronics and Systems, University of A Coruña, Spain;Computer Architecture Group, Dep. Electronics and Systems, University of A Coruña, Spain
Venue:
PaCT'07 Proceedings of the 9th international conference on Parallel Computing Technologies
Year:
2007

Citing 8
Cited 1

A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Low-Latency, Concurrent Checkpointing for Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs

Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
High performance air pollution modeling for a power plant environment

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Controller/Precompiler for Portable Checkpointing

IEICE - Transactions on Information and Systems

Performance evaluation of an application-level checkpointing solution on grids

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.