MPICH-V: toward a scalable fault tolerant MPI for volatile nodes

  • Authors:
  • George Bosilca;Aurelien Bouteiller;Franck Cappello;Samir Djilali;Gilles Fedak;Cecile Germain;Thomas Herault;Pierre Lemarinier;Oleg Lodygensky;Frederic Magniette;Vincent Neri;Anton Selikhov

  • Affiliations:
  • LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France;LRI, Université de Paris Sud, Orsay, France

  • Venue:
  • Proceedings of the 2002 ACM/IEEE conference on Supercomputing
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Global Computing platforms, large scale clusters and future TeraGRID systems gather thousands of nodes for computing parallel scientific applications. At this scale, node failures or disconnections are frequent events. This Volatility reduces the MTBF of the whole system in the range of hours or minutes.We present MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/rollback and distributed message logging. MPICH-V architecture relies on Channel Memories, Checkpoint servers and theoretically proven protocols to execute existing or new, SPMD and Master-Worker MPI applications on volatile nodes.To evaluate its capabilities, we run MPICH-V within a framework for which the number of nodes, Channels Memories and Checkpoint Servers can be completely configured as well as the node Volatility. We present a detailed performance evaluation of every component of MPICH-V and its global performance for non-trivial parallel applications. Experimental results demonstrate good scalability and high tolerance to node volatility.