Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Exploit Failure Prediction for Adaptive Fault-Tolerance in Cluster Computing
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
An intelligent management of fault tolerance in cluster using RADICMPI
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Challenges and Issues of the Integration of RADIC into Open MPI
Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Transparent redundant computing with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
A fault-tolerant cache service for web search engines: RADIC evaluation
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
The current supercomputers are almost achieving the petaflop level. These machines present a high number of interruptions in a relatively short time interval. Fault tolerance and preventive maintenance are key issues in order to enlarge the MTTI (Mean Time To Interrupt). In this paper we present how RADIC, a architecture for fault tolerance, provides different protection levels able to avoid system interruptions and allows the performance of preventive maintenance tasks. Our experiments show the effectiveness of our solution in order to keep a high availability with a large MTTI.