Reliability of task graph schedules with transient and fail-stop failures: complexity and algorithms

Authors:
Anne Benoit;Louis-Claude Canon;Emmanuel Jeannot;Yves Robert
Affiliations:
LIP, ENS Lyon, Lyon Cedex 07, France 69364 and Institut Universitaire de France, Paris, France;Nancy University, Nancy Cedex, France 54052 and INRIA, Le Chesnay Cedex, France;LaBRI, Talence Cedex, France 33405 and INRIA Bordeaux, Bordeaux Cedex, France;LIP, ENS Lyon, Lyon Cedex 07, France 69364 and Institut Universitaire de France, Paris, France
Venue:
Journal of Scheduling
Year:
2012

Citing 12
Cited 4

Task Allocation for Maximizing Reliability of Distributed Computer Systems

IEEE Transactions on Computers
Making commitments in the face of uncertainty: how to pick a winner almost every time (extended abstract)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Task Allocation Algorithms for Maximizing Reliability of Distributed Computing Systems

IEEE Transactions on Computers
Optimal Schedules for Cycle-Stealing in a Network of Workstations with a Bag-of-Tasks Workload

IEEE Transactions on Parallel and Distributed Systems
On Optimal Strategies for Cycle-Stealing in Networks of Workstations

IEEE Transactions on Computers
Scheduling Algorithms

Scheduling Algorithms
The effects of energy management on reliability in real-time embedded systems

Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design
Bi-objective scheduling algorithms for optimizing makespan and reliability on heterogeneous systems

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
Bi-objective Approximation Scheme for Makespan and Reliability Optimization on Uniform Parallel Machines

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Reliability versus performance for critical applications

Journal of Parallel and Distributed Computing
Toward Exascale Resilience

International Journal of High Performance Computing Applications
A Novel Bicriteria Scheduling Heuristics Providing a Guaranteed Global System Failure Rate

IEEE Transactions on Dependable and Secure Computing

Optimizing performance and reliability on heterogeneous parallel systems: Approximation algorithms and heuristics

Journal of Parallel and Distributed Computing
Towards fault-tolerant embedded systems with imperfect fault detection

Proceedings of the 49th Annual Design Automation Conference
Shared recovery for energy efficiency and reliability enhancements in real-time applications with precedence constraints

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reliable workflow scheduling with less resource redundancy

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper deals with the reliability of task graph schedules with transient and fail-stop failures. While computing the reliability of a given schedule is easy in the absence of task replication, the problem becomes much more difficult when task replication is used. We fill a complexity gap of the scheduling literature: our main result is that this reliability problem is #P驴-Complete (hence at least as hard as NP-Complete problems), both for transient and for fail-stop processor failures. We also study the evaluation of a restricted class of schedules, where a task cannot be scheduled before all replicas of all its predecessors have completed their execution. Although the complexity in this case with fail-stop failures remains open, we provide an algorithm to estimate the reliability while limiting evaluation costs, and we validate this approach through simulations.