Fault management in P2P-MPI

  • Authors:
  • Stéphane Genaud;Choopan Rattanapoka

  • Affiliations:
  • ICPS, LSIIT, UMR, Université Louis Pasteur, Strasbourg;ICPS, LSIIT, UMR, Université Louis Pasteur, Strasbourg

  • Venue:
  • GPC'07 Proceedings of the 2nd international conference on Advances in grid and pervasive computing
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present in this paper the recent developments done in P2P-MPI, a grid middleware, concerning the fault management, which covers fault-tolerance for applications and fault detection. P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Applications are monitored by a distributed set of external modules called failure detectors. The contribution of this paper is the analysis of the advantages and drawbacks of such detectors for a real implementation, and its integration in P2P-MPI. We pay especially attention to the reliability of the failure detection service and to the failure detection speed. We propose a variant of the binary round-robin protocol, which is more reliable than the application execution in any case. Experiments on applications of up to 256 processes, carried out on Grid'5000 show that the real detection times closely match the predictions.