Flexible resource allocation for reliable virtual cluster computing systems

Authors:
Thomas J. Hacker;Kanak Mahadik
Affiliations:
Purdue University, West Lafayette, IN;Purdue University, West Lafayette, IN
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 19
Cited 2

A Methodology for Account Management in Grid Computing Environments

GRID '01 Proceedings of the Second International Workshop on Grid Computing
Job Scheduling Under the Portable Batch System

IPPS '95 Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems

CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
TOP500 supercomputer

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Special Issue: Science Gateways—Common Community Interfaces to Grid Resources: Editorials

Concurrency and Computation: Practice & Experience - Science Gateways—Common Community Interfaces to Grid Resources
Using queue structures to improve job reliability

Proceedings of the 16th international symposium on High performance distributed computing
Insensitive Traffic Models for Communication Networks

Discrete Event Dynamic Systems
An analysis of clustered failures on large supercomputing systems

Journal of Parallel and Distributed Computing
Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters

Proceedings of the 18th ACM international symposium on High performance distributed computing
Virtual Infrastructure Management in Private and Hybrid Clouds

IEEE Internet Computing
HUBzero: A Platform for Dissemination and Collaboration in Computational Science and Engineering

IEEE Design & Test
System Modeling and Analysis: Foundations of System Performance Evaluation

System Modeling and Analysis: Foundations of System Performance Evaluation
AMREF: An Adaptive MapReduce Framework for Real Time Applications

GCC '10 Proceedings of the 2010 Ninth International Conference on Grid and Cloud Computing
Live Migration of Parallel Applications with OpenVZ

WAINA '11 Proceedings of the 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications
The NEEShub Cyberinfrastructure for Earthquake Engineering

Computing in Science and Engineering
Workload characteristics of a multi-cluster supercomputer

JSSPP'04 Proceedings of the 10th international conference on Job Scheduling Strategies for Parallel Processing
Modeling machine availability in enterprise and wide-area distributed computing environments

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Two common properties of the erlang-B function, erlang-C function, and Engset blocking function

Mathematical and Computer Modelling: An International Journal

An Analysis of Provisioning and Allocation Policies for Infrastructure-as-a-Service Clouds

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Exploring portfolio scheduling for long-term execution of scientific workloads in IaaS clouds

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Virtualization and cloud computing technologies now make it possible to create scalable and reliable virtual high performance computing clusters. Integrating these technologies, however, is complicated by fundamental and inherent differences in the way in which these systems allocate resources to computational tasks. Cloud computing systems immediately allocate available resources or deny requests. In contrast, parallel computing systems route all requests through a queue for future resource allocation. This divergence of allocation policies hinders efforts to implement efficient, responsive, and reliable virtual clusters. In this paper, we present a continuum of four scheduling polices along with an analytical resource prediction model for each policy to estimate the level of resources needed to operate an efficient, responsive, and reliable virtual cluster system. We show that it is possible to estimate the size of the virtual cluster system needed to provide a predictable grade of service for a realistic high performance computing workload and estimate the queue wait time for a partial or full resource allocation. Moreover, we show that it is possible to provide a reliable virtual cluster system using a limited pool of spare resources. The models and results we present are useful for cloud computing providers seeking to operate efficient and cost-effective virtual cluster systems.