Extended mpijava for distributed checkpointing and recovery

  • Authors:
  • Emilio Hernández;Yudith Cardinale;Wilmer Pereira

  • Affiliations:
  • Departamento de Computación y Tecnología de la Información, Universidad Simón Bolívar, Caracas, Venezuela;Departamento de Computación y Tecnología de la Información, Universidad Simón Bolívar, Caracas, Venezuela;Departamento de Computación y Tecnología de la Información, Universidad Simón Bolívar, Caracas, Venezuela

  • Venue:
  • EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe an mpiJava extension that implements a parallel checkpointing/recovery service. This checkpointing/recovery facility is transparent to applications, i.e. no instrumentation is needed. We use a distributed approach for taking the checkpoints, which means that the processes take their local checkpoints independently. This approach reduces communication between processes and there is not need for a central server for checkpoint storage. We present some experiments which suggest that the benefits of this extended MPI functionality do not have a significant performance penalty as a side effect, apart from the well-known penalties related to the local checkpoint generation.