Resource Management for Elastic Cloud Workflows

Authors:
Li Yu;Douglas Thain
Affiliations:
-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 11
Cited 1

Utopia: a load sharing facility for large, heterogeneous distributed computer systems

Software—Practice & Experience
Sun Grid Engine: Towards Creating a Compute Power Grid

CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
An Enabling Framework for Master-Worker Applications on the Computational Grid

HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids

HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Kepler: An Extensible System for Design and Execution of Scientific Workflows

SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Pegasus: A framework for mapping complex scientific workflows onto distributed systems

Scientific Programming
MapReduce: simplified data processing on large clusters

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks

Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions

Cluster Computing
Biocompute 2.0: an improved collaborative workspace for data intensive bio-science

Concurrency and Computation: Practice & Experience

Panel on grand challenges for modeling and simulation

Proceedings of the Winter Simulation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud computing systems have joined campus and private grids as powerful and highly scalable environments for scientific computing. Furthermore, distributed applications are typically expressed in a form that allows them to run on an arbitrary number of nodes while tolerating failures and changes in available resources. This flexibility introduces problems relating to how many nodes an application can use, and how they should be allocated. In this paper, we explore these problems by presenting a general purpose architecture for scalable cloud applications, and describe inherent resource management problems. We address these challenges by developing methods for runtime measurement of the number of nodes an application can use, for appropriately placing masters and workers, and for matching workers to masters. Finally, we propose a resource management mechanism that allows automatic resource allocation and flexible resource distribution. These techniques are presented in the context of our specific cloud architecture, but the lessons apply to any system where competing elastic applications must be right-sized to the available resources.