A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

Authors:
Rim Chayeh;Christophe Cerin;Mohamed Jemni
Affiliations:
Ecole Supérieure des Sciences et Techniques de Tunis, Unité de recherche UTIC, Tunis, Tunisia;LIPN-UMR CNRS 7030, Institut Galilée, Université Paris-Nord, Villetaneuse, France 93430;Ecole Supérieure des Sciences et Techniques de Tunis, Unité de recherche UTIC, Tunis, Tunisia
Venue:
GPC '09 Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing
Year:
2009

Citing 6
Cited 0

Sabotage-tolerance mechanisms for volunteer computing systems

Future Generation Computer Systems - Best papers from symp. on cluster computing and the grid (CCGRID 2001)
Global Computing Systems

LSSC '01 Proceedings of the Third International Conference on Large-Scale Scientific Computing-Revised Papers
Using Data-Flow Analysis for Resilience and Result Checking in Peer-To-Peer Computations

DEXA '04 Proceedings of the Database and Expert Systems Applications, 15th International Workshop
FlowCert: Probabilistic Certification for Peer-to-Peer Computations

SBAC-PAD '04 Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing
Internet computing of tasks with dependencies using unreliable workers

OPODIS'04 Proceedings of the 8th international conference on Principles of Distributed Systems
Characterizing result errors in internet desktop grids

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with fault tolerant recovery mechanisms and probabilistic results certification issues on large scale architectures. The related works in the result certification domain are based on a total or a partial duplication of the application. However, they are limited to independent tasks executions. In the present work, we extend these mechanisms to dependant tasks applications. First of all we propose an approach, based on an abstract representation of a parallel execution called macro-dataflow graph. Second we introduce probabilistic certification algorithms that avoid the re-execution of the program, allowing for recovery on different platforms under different number of processors. We also sketch how to simulate our framework according to state of the art, modeling workloads and fault injection tools.