Contention awareness and fault-tolerant scheduling for precedence constrained tasks in heterogeneous systems

Authors:
Anne Benoit;Mourad Hakem;Yves Robert
Affiliations:
ENS Lyon, Université de Lyon, LIP Laboratory, UMR 5668, ENS Lyon - CNRS - INRIA - UCBL, Lyon, France;ENS Lyon, Université de Lyon, LIP Laboratory, UMR 5668, ENS Lyon - CNRS - INRIA - UCBL, Lyon, France;ENS Lyon, Université de Lyon, LIP Laboratory, UMR 5668, ENS Lyon - CNRS - INRIA - UCBL, Lyon, France
Venue:
Parallel Computing
Year:
2009

Citing 23
Cited 6

Problems and approaches for a Teraflop processor

C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Models of machines and computation for mapping in multicomputers

ACM Computing Surveys (CSUR)
Making commitments in the face of uncertainty: how to pick a winner almost every time (extended abstract)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Task matching and scheduling in heterogeneous computing environments using a genetic-algorithm-based approach

Journal of Parallel and Distributed Computing - Special issue on parallel evolutionary computing
A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis

IEEE Transactions on Parallel and Distributed Systems
On Exploiting Task Duplication in Parallel Program Scheduling

IEEE Transactions on Parallel and Distributed Systems
Task Allocation on a Network of Processors

IEEE Transactions on Computers
Optimal Schedules for Cycle-Stealing in a Network of Workstations with a Bag-of-Tasks Workload

IEEE Transactions on Parallel and Distributed Systems
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

IEEE Transactions on Parallel and Distributed Systems
Scheduling and Load Balancing in Parallel and Distributed Systems

Scheduling and Load Balancing in Parallel and Distributed Systems
Task Scheduling in Multiprocessing Systems

Computer
On Optimal Strategies for Cycle-Stealing in Networks of Workstations

IEEE Transactions on Computers
Broadcast scheduling optimization for heterogeneous cluster systems

Journal of Algorithms
Efficient Collective Communication on Heterogeneous Networks of Workstations

ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
A New Approach to Realizing Fault-Tolerant Multiprocessor Scheduling by Exploiting Implicit Redundancy

FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Fault-Tolerant Real-Time Scheduling under Execution Time Constraints

RTCSA '99 Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications
Efficient collective communication in distributed heterogeneous systems

Journal of Parallel and Distributed Computing
Efficient Collective Communication in Distributed Heterogeneous Systems

ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
On broadcasting in heterogenous networks

SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient overloading techniques for primary-backup scheduling in real-time systems

Journal of Parallel and Distributed Computing
Communication Contention in Task Scheduling

IEEE Transactions on Parallel and Distributed Systems
A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems

Parallel Computing

Reliability-aware scheduling strategy for heterogeneous distributed computing systems

Journal of Parallel and Distributed Computing
Reliable parallel programming model for distributed computing environments

Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
Hierarchical scheduling of DAG structured computations on manycore processors with dynamic thread grouping

JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Decentralized proactive resource allocation for maximizing throughput of P2P Grid

Journal of Parallel and Distributed Computing
Reliable workflow scheduling with less resource redundancy

Parallel Computing
Extending Amdahl's law and Gustafson's law by evaluating interconnections on multi-core processors

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Heterogeneous distributed systems are widely deployed for executing computationally intensive parallel applications with diverse computing needs. Such environments require effective scheduling strategies that take into account both algorithmic and architectural characteristics. Unfortunately, most of the scheduling algorithms developed for such systems rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors are completely safe. To schedule precedence graphs in a more realistic framework, we introduce first an efficient fault-tolerant scheduling algorithm that is both contention-aware and capable of supporting an arbitrary number of fail-silent (fail-stop) processor failures. Next, we derive a more complex heuristic that departs from the main principle of the first algorithm. Instead of considering a single task (one with highest priority) and assigning all its replicas to the currently best available resources, we consider a chunk of ready tasks, and assign all their replicas in the same decision making procedure. This leads to a better load balance of processors and communication links. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithms have a low time complexity, and drastically reduce the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithms, which lead to efficient execution schemes while guaranteeing a prescribed level of fault-tolerance.