MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Fault Tolerance in Message Passing Interface Programs
International Journal of High Performance Computing Applications
Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
Parallel machines are growing in complexity and number of components which increases fault probability. Thus, MPI applications running on these machines may not reach completion. This paper presents RADIC/OMPI, which is the integration of RADIC fault tolerance architecture into Open MPI. RADIC/OMPI relies on uncoordinated checkpoints combined with pessimistic receiver-based message logs in a distributed way without the need to use any central or stable elements. Due to this, it assures the application completion automatically and transparently for users and administrators. We concluded that within certain applications RADIC/OMPI provides fault tolerance with an acceptable overhead even in the presence of consecutive faults.