A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Low-Latency, Concurrent Checkpointing for Parallel Programs
IEEE Transactions on Parallel and Distributed Systems
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Automated application-level checkpointing of MPI programs
Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
High performance air pollution modeling for a power plant environment
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Controller/Precompiler for Portable Checkpointing
IEICE - Transactions on Information and Systems
Performance evaluation of an application-level checkpointing solution on grids
Future Generation Computer Systems
Hi-index | 0.00 |
The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available tools to help parallel programmers to enhance their applications with fault tolerance support. This work presents an experience to endow with fault tolerance two large MPI scientific applications: an air quality simulation model and a crack growth analysis. A fault tolerant solution has been implemented by means of a checkpointing and recovery tool, the CPPC framework. Detailed experimental results are presented to show the practical usefulness and low overhead of this checkpointing approach.