On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Authors:
Qin Zheng;Bharadwaj Veeravalli
Affiliations:
Advanced Computing Programme, Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), Singapore 138632, Singapore and Computer Networks and Distributed Syste ...;Advanced Computing Programme, Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR), Singapore 138632, Singapore and Computer Networks and Distributed Syste ...
Venue:
Journal of Parallel and Distributed Computing
Year:
2009

Citing 20
Cited 3

Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
The grid: blueprint for a new computing infrastructure

The grid: blueprint for a new computing infrastructure
A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis

IEEE Transactions on Parallel and Distributed Systems
Deterministic Processor Scheduling

ACM Computing Surveys (CSUR)
Introduction to Algorithms

Introduction to Algorithms
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Cluster Computing
Fast Allocation of Processes in Distributed and Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
A New Fault-Tolerant Technique for Improving the Schedulability in Multiprocessor Real-time Systems

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
A Problem-Specific Fault-Tolerance Mechanism for Asynchronous, Distributed Systems

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
An Efficient Fault-Tolerant Scheduling Algorithm for Real-Time Tasks with Precedence Constraints in Heterogeneous Systems

ICPP '02 Proceedings of the 2002 International Conference on Parallel Processing
Dynamic Replica Management in the Service Grid

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Faults in Grids: Why are they so bad and What can be done about it?

GRID '03 Proceedings of the 4th International Workshop on Grid Computing
Communication Contention in Task Scheduling

IEEE Transactions on Parallel and Distributed Systems
Biobjective Scheduling Algorithms for Execution Time–Reliability Trade-off in Heterogeneous Computing Systems*

The Computer Journal
Fault-tolerant grid services using primary-backup: feasibility and performance

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems

Parallel Computing
Dynamic Load Balancing in Distributed Systems in the Presence of Delays: A Regeneration-Theory Approach

IEEE Transactions on Parallel and Distributed Systems
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Force-directed scheduling for the behavioral synthesis of ASICs

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Automated Synthesis of Data Paths in Digital Systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Reliability-aware scheduling strategy for heterogeneous distributed computing systems

Journal of Parallel and Distributed Computing
A fault-tolerant scheduling system for computational grids

Computers and Electrical Engineering
Reliable workflow scheduling with less resource redundancy

Parallel Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms.