A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
MPICH-V: toward a scalable fault tolerant MPI for volatile nodes
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Providing Non-stop Service for Message-Passing Based Parallel Applications with RADIC
Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Communication target selection for replicated MPI processes
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
A Robust and Efficient Message Passing Library for Volunteer Computing Environments
Journal of Grid Computing
Fault tolerance in an industrial seismic processing application for multicore clusters
EuroMPI'11 Proceedings of the 18th European MPI Users' Group conference on Recent advances in the message passing interface
Hi-index | 0.00 |
Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism.