Problems and approaches for a Teraflop processor
C3P Proceedings of the third conference on Hypercube concurrent computers and applications: Architecture, software, computer systems, and general issues - Volume 1
Models of machines and computation for mapping in multicomputers
ACM Computing Surveys (CSUR)
STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Fault-Tolerance Through Scheduling of Aperiodic Tasks in Hard Real-Time Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
Journal of Parallel and Distributed Computing - Special issue on parallel evolutionary computing
A Fault-Tolerant Dynamic Scheduling Algorithm for Multiprocessor Real-Time Systems and Its Analysis
IEEE Transactions on Parallel and Distributed Systems
On Exploiting Task Duplication in Parallel Program Scheduling
IEEE Transactions on Parallel and Distributed Systems
Task Allocation on a Network of Processors
IEEE Transactions on Computers
Optimal Schedules for Cycle-Stealing in a Network of Workstations with a Bag-of-Tasks Workload
IEEE Transactions on Parallel and Distributed Systems
Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing
IEEE Transactions on Parallel and Distributed Systems
Scheduling and Load Balancing in Parallel and Distributed Systems
Scheduling and Load Balancing in Parallel and Distributed Systems
On Optimal Strategies for Cycle-Stealing in Networks of Workstations
IEEE Transactions on Computers
Broadcast scheduling optimization for heterogeneous cluster systems
Journal of Algorithms
Efficient Collective Communication on Heterogeneous Networks of Workstations
ICPP '98 Proceedings of the 1998 International Conference on Parallel Processing
FTCS '97 Proceedings of the 27th International Symposium on Fault-Tolerant Computing (FTCS '97)
Fault-Tolerant Real-Time Scheduling under Execution Time Constraints
RTCSA '99 Proceedings of the Sixth International Conference on Real-Time Computing Systems and Applications
Efficient collective communication in distributed heterogeneous systems
Journal of Parallel and Distributed Computing
Efficient Collective Communication in Distributed Heterogeneous Systems
ICDCS '99 Proceedings of the 19th IEEE International Conference on Distributed Computing Systems
On broadcasting in heterogenous networks
SODA '04 Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms
Efficient overloading techniques for primary-backup scheduling in real-time systems
Journal of Parallel and Distributed Computing
Communication Contention in Task Scheduling
IEEE Transactions on Parallel and Distributed Systems
Reliability-aware scheduling strategy for heterogeneous distributed computing systems
Journal of Parallel and Distributed Computing
Reliable parallel programming model for distributed computing environments
Euro-Par'09 Proceedings of the 2009 international conference on Parallel processing
JSSPP'10 Proceedings of the 15th international conference on Job scheduling strategies for parallel processing
Decentralized proactive resource allocation for maximizing throughput of P2P Grid
Journal of Parallel and Distributed Computing
Reliable workflow scheduling with less resource redundancy
Parallel Computing
Extending Amdahl's law and Gustafson's law by evaluating interconnections on multi-core processors
The Journal of Supercomputing
Hi-index | 0.00 |
Heterogeneous distributed systems are widely deployed for executing computationally intensive parallel applications with diverse computing needs. Such environments require effective scheduling strategies that take into account both algorithmic and architectural characteristics. Unfortunately, most of the scheduling algorithms developed for such systems rely on a simple platform model where communication contention is not taken into account. In addition, it is generally assumed that processors are completely safe. To schedule precedence graphs in a more realistic framework, we introduce first an efficient fault-tolerant scheduling algorithm that is both contention-aware and capable of supporting an arbitrary number of fail-silent (fail-stop) processor failures. Next, we derive a more complex heuristic that departs from the main principle of the first algorithm. Instead of considering a single task (one with highest priority) and assigning all its replicas to the currently best available resources, we consider a chunk of ready tasks, and assign all their replicas in the same decision making procedure. This leads to a better load balance of processors and communication links. We focus on a bi-criteria approach, where we aim at minimizing the total execution time, or latency, given a fixed number of failures supported in the system. Our algorithms have a low time complexity, and drastically reduce the number of additional communications induced by the replication mechanism. Experimental results fully demonstrate the usefulness of the proposed algorithms, which lead to efficient execution schemes while guaranteeing a prescribed level of fault-tolerance.