Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
The grid: blueprint for a new computing infrastructure
The grid: blueprint for a new computing infrastructure
A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis
IEEE Transactions on Parallel and Distributed Systems
Deterministic Processor Scheduling
ACM Computing Surveys (CSUR)
Introduction to Algorithms
Condor-G: A Computation Management Agent for Multi-Institutional Grids
Cluster Computing
Fast Allocation of Processes in Distributed and Parallel Systems
IEEE Transactions on Parallel and Distributed Systems
A New Fault-Tolerant Technique for Improving the Schedulability in Multiprocessor Real-time Systems
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems
ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Dynamic Replica Management in the Service Grid
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Faults in Grids: Why are they so bad and What can be done about it?
GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Communication Contention in Task Scheduling
IEEE Transactions on Parallel and Distributed Systems
Fault-tolerant grid services using primary-backup: feasibility and performance
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
IEEE Transactions on Parallel and Distributed Systems
Performance under failures of high-end computing
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Force-directed scheduling for the behavioral synthesis of ASICs
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Automated Synthesis of Data Paths in Digital Systems
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Reliability-aware scheduling strategy for heterogeneous distributed computing systems
Journal of Parallel and Distributed Computing
A fault-tolerant scheduling system for computational grids
Computers and Electrical Engineering
Reliable workflow scheduling with less resource redundancy
Parallel Computing
Hi-index | 0.00 |
Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms.