Introduction to Reinforcement Learning
Introduction to Reinforcement Learning
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Fault Tolerance via Replication in Coarse Grain Data-Flow
PSLS '95 Proceedings of the International Workshop on Parallel Symbolic Languages and Systems
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study
Proceedings of the 2009 workshop on Resiliency in high performance
A model for predicting the optimum checkpoint interval for restart dumps
ICCS'03 Proceedings of the 2003 international conference on Computational science
Hi-index | 0.00 |
Traditional parallel programming techniques will suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. To address this challenge, we introduce and simulate a novel software architecture that combines a task dependency graph with a substitution graph. The role of the dependency graph is to limit communication and checkpointing and enhance fault tolerance by allowing graph neighbors to exchange data, while the substitution graph promotes fault oblivious computing by allowing a failed task to be substituted on-the-fly by another task, incurring a quantifiable error. We present optimization formulations for trading off substitution errors and other factors such as available system capacity and low-overlap task partitioning among processors, and demonstrate that these can be approximately solved in real time after some simplifications. Simulation studies of our proposed approach indicate that a substitution network adds considerable resilience and simple enhancements can limit the aggregate substitution errors.