A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing

  • Authors:
  • Samir Jafar;Thierry Gautier;Axel Krings;Jean-Louis Roch

  • Affiliations:
  • Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), Montbonnot St. Martin, France;Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), Montbonnot St. Martin, France;Computer Science Dept, University of Idaho, Moscow, ID;Laboratoire ID – IMAG, Pre-project MOAIS (CNRS-INRIA, INPG-UJF), Montbonnot St. Martin, France

  • Venue:
  • Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a new checkpoint/recovery method for dataflow computations using work-stealing in heterogeneous environments as found in grid or cluster computing. Basing the state of the computation on a dynamic macro dataflow graph, it is shown that the mechanisms provide effective checkpointing for multithreaded applications in heterogeneous environments. Two methods, Systematic Event Logging and Theft-Induced Checkpointing, are presented that are efficient and extremely flexible under the system-state model, allowing for recovery on different platforms under different number of processors. A formal analysis of the overhead induced by both methods is presented, followed by an experimental evaluation in a large cluster. It is shown that both methods have very small overhead and that trade-offs between checkpointing and recovery cost can be controlled.