Utopia: a load sharing facility for large, heterogeneous distributed computer systems
Software—Practice & Experience
Sun Grid Engine: Towards Creating a Compute Power Grid
CCGRID '01 Proceedings of the 1st International Symposium on Cluster Computing and the Grid
An Enabling Framework for Master-Worker Applications on the Computational Grid
HPDC '00 Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing
Condor-G: A Computation Management Agent for Multi-Institutional Grids
HPDC '01 Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing
Kepler: An Extensible System for Design and Execution of Scientific Workflows
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
MapReduce: simplified data processing on large clusters
OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
Dryad: distributed data-parallel programs from sequential building blocks
Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007
Falkon: a Fast and Light-weight tasK executiON framework
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Biocompute 2.0: an improved collaborative workspace for data intensive bio-science
Concurrency and Computation: Practice & Experience
Panel on grand challenges for modeling and simulation
Proceedings of the Winter Simulation Conference
Hi-index | 0.00 |
Cloud computing systems have joined campus and private grids as powerful and highly scalable environments for scientific computing. Furthermore, distributed applications are typically expressed in a form that allows them to run on an arbitrary number of nodes while tolerating failures and changes in available resources. This flexibility introduces problems relating to how many nodes an application can use, and how they should be allocated. In this paper, we explore these problems by presenting a general purpose architecture for scalable cloud applications, and describe inherent resource management problems. We address these challenges by developing methods for runtime measurement of the number of nodes an application can use, for appropriately placing masters and workers, and for matching workers to masters. Finally, we propose a resource management mechanism that allows automatic resource allocation and flexible resource distribution. These techniques are presented in the context of our specific cloud architecture, but the lessons apply to any system where competing elastic applications must be right-sized to the available resources.