Fault oblivious high performance computing with dynamic task replication and substitution

Authors:
Yevgeniy Vorobeychik;Jackson R. Mayo;Robert C. Armstrong;Ronald G. Minnich;Don W. Rudish
Affiliations:
Sandia National Laboratories, Livermore, USA 94551-0969;Sandia National Laboratories, Livermore, USA 94551-0969;Sandia National Laboratories, Livermore, USA 94551-0969;Sandia National Laboratories, Livermore, USA 94551-0969;Sandia National Laboratories, Livermore, USA 94551-0969
Venue:
Computer Science - Research and Development
Year:
2011

Citing 5
Cited 0

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Fault Tolerance via Replication in Coarse Grain Data-Flow

PSLS '95 Proceedings of the International Workshop on Parallel Symbolic Languages and Systems
Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Proceedings of the 2009 workshop on Resiliency in high performance
A model for predicting the optimum checkpoint interval for restart dumps

ICCS'03 Proceedings of the 2003 international conference on Computational science

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditional parallel programming techniques will suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. To address this challenge, we introduce and simulate a novel software architecture that combines a task dependency graph with a substitution graph. The role of the dependency graph is to limit communication and checkpointing and enhance fault tolerance by allowing graph neighbors to exchange data, while the substitution graph promotes fault oblivious computing by allowing a failed task to be substituted on-the-fly by another task, incurring a quantifiable error. We present optimization formulations for trading off substitution errors and other factors such as available system capacity and low-overlap task partitioning among processors, and demonstrate that these can be approximately solved in real time after some simplifications. Simulation studies of our proposed approach indicate that a substitution network adds considerable resilience and simple enhancements can limit the aggregate substitution errors.