A sequential stochastic assignment problem in a partially observable Markov chain
Mathematics of Operations Research
Journal of Parallel and Distributed Computing - Special issue on parallel evolutionary computing
On the availability of a distributed computer system with failing components
SIGMETRICS '85 Proceedings of the 1985 ACM SIGMETRICS conference on Measurement and modeling of computer systems
Heuristic Algorithms for Scheduling Independent Tasks on Nonidentical Processors
Journal of the ACM (JACM)
Distributed Job Scheduling on Computational Grids Using Multiple Simultaneous Requests
HPDC '02 Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing
A Methodology for Detection and Estimation of Software Aging
ISSRE '98 Proceedings of the The Ninth International Symposium on Software Reliability Engineering
An Approach for Estimation of Software Aging in a Web Server
ISESE '02 Proceedings of the 2002 International Symposium on Empirical Software Engineering
Computer Architecture: A Quantitative Approach
Computer Architecture: A Quantitative Approach
Energy efficiency and fairness tradeoffs in multi-resource, multi-tasking embedded systems
Proceedings of the 2003 international symposium on Low power electronics and design
Periodic Resource Model for Compositional Real-Time Guarantees
RTSS '03 Proceedings of the 24th IEEE International Real-Time Systems Symposium
The Effect of Different Failure Recovery Procedures on the Distribution of Task Completion Times
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 16 - Volume 17
Reliable Distributed Systems: Technologies, Web Services, and Applications
Reliable Distributed Systems: Technologies, Web Services, and Applications
A resource manager for optimal resource selection and fault tolerance service in Grids
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
Dynamically mapping tasks with priorities and multiple deadlines in a heterogeneous environment
Journal of Parallel and Distributed Computing
Stochastic robustness metric and its use for static resource allocations
Journal of Parallel and Distributed Computing
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Load balancing in the presence of random node failure and recovery
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Hi-index | 0.00 |
The problem of finding efficient workload distribution techniques is becoming increasingly important today for heterogeneous distributed systems where the availability of compute nodes may change spontaneously over time. Resource-allocation policies designed for such systems should maximize the performance and, at the same time, be robust against failure and recovery of compute nodes. Such a policy, based on the concepts of the Derman-Lieberman-Ross theorem, is proposed in this work, and is applied to a simulated model of a dedicated system composed of a set of heterogeneous image processing servers. Assuming that each image results in a ''reward'' if its processing is completed before a certain deadline, the goal for the resource allocation policy is to maximize the expected cumulative reward. An extensive analysis was done to study the performance of the proposed policy and compare it with the performance of some existing policies adapted to this environment. Our experiments conducted for various types of task-machine heterogeneity illustrate the potential of our method for solving resource allocation problems in a broad spectrum of distributed systems that experience high failure rates.