An intelligent management of fault tolerance in cluster using RADICMPI

  • Authors:
  • Angelo Duarte;Dolores Rexachs;Emilio Luque

  • Affiliations:
  • Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Barcelona, Spain;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Barcelona, Spain;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona, Barcelona, Spain

  • Venue:
  • EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Independence of special elements, transparency and scalability are very significant features required from the fault tolerance schemes for modern clusters of computers. In order to attend such requirements we developed the RADIC architecture (Redundant Array of Distributed Independent Checkpoints). RADIC is an architecture based on a fully distributed array of processes that collaborate in order to create a distributed fault tolerance controller. This controller works without special, central or stable elements. RADIC implements the fault tolerance activities, transparently to the user application, using a message-log rollback-recovery protocol. Using the RADIC concepts we implemented a prototype, RADICMPI, which contains some standard MPI directives and includes all functionalities of RADIC. We tested RADICMPI in a real environment by injecting failures in nodes of the cluster and monitoring the behavior of the application. Our tests confirmed the correct operation of RADICMPI and the effectiveness of the RADIC mechanism.