Challenges and Issues of the Integration of RADIC into Open MPI

  • Authors:
  • Leonardo Fialho;Guna Santos;Angelo Duarte;Dolores Rexachs;Emilio Luque

  • Affiliations:
  • Computer Architecture and Operating Systems Department, University Autonoma of Barcelona., Bellaterra, Barcelona, Spain 08193;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona., Bellaterra, Barcelona, Spain 08193;Departamento de Tecnologia, Universiade Estadual de Feira de Santana. Feira de Santana, Bahia, Brazil;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona., Bellaterra, Barcelona, Spain 08193;Computer Architecture and Operating Systems Department, University Autonoma of Barcelona., Bellaterra, Barcelona, Spain 08193

  • Venue:
  • Proceedings of the 16th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel machines are growing in complexity and number of components which increases fault probability. Thus, MPI applications running on these machines may not reach completion. This paper presents RADIC/OMPI, which is the integration of RADIC fault tolerance architecture into Open MPI. RADIC/OMPI relies on uncoordinated checkpoints combined with pessimistic receiver-based message logs in a distributed way without the need to use any central or stable elements. Due to this, it assures the application completion automatically and transparently for users and administrators. We concluded that within certain applications RADIC/OMPI provides fault tolerance with an acceptable overhead even in the presence of consecutive faults.