A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

  • Authors:
  • Rim Chayeh;Christophe Cerin;Mohamed Jemni

  • Affiliations:
  • Ecole Supérieure des Sciences et Techniques de Tunis, Unité de recherche UTIC, Tunis, Tunisia;LIPN-UMR CNRS 7030, Institut Galilée, Université Paris-Nord, Villetaneuse, France 93430;Ecole Supérieure des Sciences et Techniques de Tunis, Unité de recherche UTIC, Tunis, Tunisia

  • Venue:
  • GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper deals with fault tolerant recovery mechanisms and probabilistic results certification issues on large scale architectures. The related works in the result certification domain are based on a total or a partial duplication of the application. However, they are limited to independent tasks executions. In the present work, we extend these mechanisms to dependant tasks applications. First of all we propose an approach, based on an abstract representation of a parallel execution called macro-dataflow graph. Second we introduce probabilistic certification algorithms that avoid the re-execution of the program, allowing for recovery on different platforms under different number of processors. We also sketch how to simulate our framework according to state of the art, modeling workloads and fault injection tools.