Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

Authors:
Bahman Javadi;Parimala Thulasiraman;Rajkumar Buyya
Affiliations:
School of Computing, Engineering and Mathematics, University of Western Sydney, Sydney, Australia;InterDisciplinary Evolving Algorithmic Sciences (IDEAS) Laboratory, Department of Computer Science, University of Manitoba, Winnipeg, Canada;Cloud Computing and Distributed Systems (CLOUDS) Laboratory, Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Venue:
The Journal of Supercomputing
Year:
2013

Citing 29
Cited 0

Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling

IEEE Transactions on Parallel and Distributed Systems
Collecting Unused Processing Capacity: An Analysis of Transient Distributed Systems

IEEE Transactions on Parallel and Distributed Systems
The ANL/IBM SP Scheduling System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Theory and Practice in Parallel Job Scheduling

IPPS '97 Proceedings of the Job Scheduling Strategies for Parallel Processing
Selective Reservation Strategies for Backfill Job Scheduling

JSSPP '02 Revised Papers from the 8th International Workshop on Job Scheduling Strategies for Parallel Processing
Optimal probabilistic routing in distributed parallel queues

ACM SIGMETRICS Performance Evaluation Review
Contract-based load management in federated distributed systems

NSDI'04 Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation - Volume 1
Sharing networked resources with brokered leases

ATEC '06 Proceedings of the annual conference on USENIX '06 Annual Technical Conference
Virtual Machine Hosting for Networked Clusters: Building the Foundations for "Autonomic" Orchestration

VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
Inter-operating grids through delegated matchmaking

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
The Grid Workloads Archive

Future Generation Computer Systems
Amazon S3 for science grids: a viable solution?

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
The cost of doing science on the cloud: the Montage example

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters

Proceedings of the 18th ACM international symposium on High performance distributed computing
The Eucalyptus Open-Source Cloud-Computing System

CCGRID '09 Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid
Cost-benefit analysis of Cloud Computing versus desktop grids

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Virtual Infrastructure Management in Private and Hybrid Clouds

IEEE Internet Computing
Harnessing Cloud Technologies for a Virtualized Distributed Computing Infrastructure

IEEE Internet Computing
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines
A grid workflow environment for brain imaging analysis on distributed systems

Concurrency and Computation: Practice & Experience - Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)
Prospects of collaboration between compute providers by means of job interchange

JSSPP'07 Proceedings of the 13th international conference on Job scheduling strategies for parallel processing
The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Elastic Site: Using Clouds to Elastically Extend Site Resources

CCGRID '10 Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Grid Architecture from a Metascheduling Perspective

Computer
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms

Software—Practice & Experience
Making wide-area, multi-site MPI feasible using xen VM

ISPA'06 Proceedings of the 2006 international conference on Frontiers of High Performance Computing and Networking
Evaluation of gang scheduling performance and cost in a cloud computing system

The Journal of Supercomputing
Workload characteristics of a multi-cluster supercomputer

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we investigate Cloud computing resource provisioning to extend the computing capacity of local clusters in the presence of failures. We consider three steps in the resource provisioning including resource brokering, dispatch sequences, and scheduling. The proposed brokering strategy is based on the stochastic analysis of routing in distributed parallel queues and takes into account the response time of the Cloud provider and the local cluster while considering computing cost of both sides. Moreover, we propose dispatching with probabilistic and deterministic sequences to redirect requests to the resource providers. We also incorporate checkpointing in some well-known scheduling algorithms to provide a fault-tolerant environment. We propose two cost-aware and failure-aware provisioning policies that can be utilized by an organization that operates a cluster managed by virtual machine technology, and seeks to use resources from a public Cloud provider. Simulation results demonstrate that the proposed policies improve the response time of users' requests by a factor of 4.10 under a moderate load with a limited cost on a public Cloud.