Safety and Reliability Driven Task Allocation in Distributed Systems

Authors:
Santhanam Srinivasan;Niraj K. Jha
Affiliations:
Lucent Bell Labs, Holmdel, NJ;Princeton Univ., Princeton, NJ
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 12
Cited 30

A Fault-Tolerant Systolic Sorter

IEEE Transactions on Computers
Heuristic Algorithms for Task Assignment in Distributed Systems

IEEE Transactions on Computers
Design & analysis of fault tolerant digital systems

Design & analysis of fault tolerant digital systems
Dynamic load balancing for distributed memory multiprocessors

Journal of Parallel and Distributed Computing
Algorithm-Based Fault Detection for Signal Processing Applications

IEEE Transactions on Computers
On the Assignment Problem of Arbitrary Process Systems to Heterogeneous Distributed Computer Systems

IEEE Transactions on Computers
Task Allocation for Maximizing Reliability of Distributed Computer Systems

IEEE Transactions on Computers
Fault-Tolerant Design Strategies for High Reliability and Safety

IEEE Transactions on Computers
Fast Allocation of Processes in Distributed and Parallel Systems

IEEE Transactions on Parallel and Distributed Systems
A Compile-Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures

IEEE Transactions on Parallel and Distributed Systems
Fast Algorithms for Distributed Resource Allocation

IEEE Transactions on Parallel and Distributed Systems
Declustering: A New Multiprocessor Scheduling Technique

IEEE Transactions on Parallel and Distributed Systems

Effective Reformulations for Task Allocation in Distributed Systems with a Large Number of Communicating Tasks

IEEE Transactions on Knowledge and Data Engineering
Efficient Assignment and Scheduling for Heterogeneous DSP Systems

IEEE Transactions on Parallel and Distributed Systems
Iterative list scheduling for heterogeneous computing

Journal of Parallel and Distributed Computing
A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters

Journal of Parallel and Distributed Computing
Efficient allocation of distributed object-oriented tasks to a pre-defined scheduled system

International Journal of Computers and Applications
A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems

Parallel Computing
Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm

Future Generation Computer Systems
Task allocation for maximizing reliability of a distributed system using hybrid particle swarm optimization

Journal of Systems and Software
Dynamic partner identification in mobile agent-based distributed job workflow execution

Journal of Parallel and Distributed Computing
Stochastic scheduling for multiclass applications with availability requirements in heterogeneous clusters

Cluster Computing
Energy efficient scheduling for parallel applications on mobile clusters

Cluster Computing
A simulation framework for energy efficient data grids

Proceedings of the 39th conference on Winter simulation: 40 years! The best is yet to come
Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
On improving resource utilization and system throughput of master slave job scheduling in heterogeneous systems

The Journal of Supercomputing
Cost minimization while satisfying hard/soft timing constraints for heterogeneous embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Reliability versus performance for critical applications

Journal of Parallel and Distributed Computing
Optimizing availability and QoS of heterogeneous distributed system based on residual lifetime in uncertain environment

The Journal of Supercomputing
Performance under Failures of DAG-based Parallel Computing

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies

Future Generation Computer Systems
A novel fault-tolerant scheduling algorithm for precedence constrained tasks in real-time heterogeneous systems

Parallel Computing
The decision model of task allocation for constrained stochastic distributed systems

Computers and Industrial Engineering
A scheduling model for maximizing availability with makespan constraint based on residual lifetime in heterogeneous clusters

NPC'07 Proceedings of the 2007 IFIP international conference on Network and parallel computing
Application and comparison of hybrid evolutionary multiobjective optimization algorithms for solving task scheduling problem on heterogeneous systems

Applied Soft Computing
A security-oriented task scheduler for heterogeneous distributed systems

HiPC'06 Proceedings of the 13th international conference on High Performance Computing
Fault-tolerant scheduling based on periodic tasks for heterogeneous systems

ATC'06 Proceedings of the Third international conference on Autonomic and Trusted Computing
An availability-aware task scheduling for heterogeneous systems using quantum-behaved particle swarm optimization

ICSI'10 Proceedings of the First international conference on Advances in Swarm Intelligence - Volume Part I
An enhanced DGIDE platform for intrusion detection

ATC'07 Proceedings of the 4th international conference on Autonomic and Trusted Computing
Reliability Based Scheduling Model RSM for Computational Grids

International Journal of Distributed Systems and Technologies
Resource Management in Real Time Distributed System with Security Constraints: A Review

International Journal of Distributed Systems and Technologies
Load balanced reliable task scheduling algorithm for heterogeneous systems

Journal of High Speed Networks

Quantified Score

Hi-index	0.01

Visualization

Abstract

Distributed computer systems are increasingly being employed for critical applications, such as aircraft control, industrial process control, and banking systems. Maximizing performance has been the conventional objective in the allocation of tasks for such systems. Inherently, distributed systems are more complex than centralized systems. The added complexity could increase the potential for system failures. Some work has been done in the past in allocating tasks to distributed systems, considering reliability as the objective function to be maximized. Reliability is defined to be the probability that none of the system components fails while processing. This, however, does not give any guarantees as to the behavior of the system when a failure occurs. A failure, not detected immediately, could lead to a catastrophe. Such systems are unsafe. In this paper, we describe a method to determine an allocation that introduces safety into a heterogeneous distributed system and at the same time attempts to maximize its reliability. First, we devise a new heuristic, based on the concept of clustering, to allocate tasks for maximizing reliability. We show that for task graphs with precedence constraints, our heuristic performs better than previously proposed heuristics. Next, by applying the concept of task-based fault tolerance, which we have previously proposed, we add extra assertion tasks to the system to make it safe. We present a new heuristic that does this in such a way that the decrease in reliability for the added safety is minimized. For the purpose of allocating the extra tasks, this heuristic performs as well as previously known methods and runs an order of magnitude faster. We present a number of simulation results to prove the efficacy of our scheme.