Sabotage-tolerance mechanisms for volunteer computing systems
Future Generation Computer Systems - Best papers from symp. on cluster computing and the grid (CCGRID 2001)
LSSC '01 Proceedings of the Third International Conference on Large-Scale Scientific Computing-Revised Papers
Using Data-Flow Analysis for Resilience and Result Checking in Peer-To-Peer Computations
DEXA '04 Proceedings of the Database and Expert Systems Applications, 15th International Workshop
FlowCert: Probabilistic Certification for Peer-to-Peer Computations
SBAC-PAD '04 Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing
Internet computing of tasks with dependencies using unreliable workers
OPODIS'04 Proceedings of the 8th international conference on Principles of Distributed Systems
Characterizing result errors in internet desktop grids
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Hi-index | 0.00 |
This paper deals with fault tolerant recovery mechanisms and probabilistic results certification issues on large scale architectures. The related works in the result certification domain are based on a total or a partial duplication of the application. However, they are limited to independent tasks executions. In the present work, we extend these mechanisms to dependant tasks applications. First of all we propose an approach, based on an abstract representation of a parallel execution called macro-dataflow graph. Second we introduce probabilistic certification algorithms that avoid the re-execution of the program, allowing for recovery on different platforms under different number of processors. We also sketch how to simulate our framework according to state of the art, modeling workloads and fault injection tools.