Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

Authors:
Moussa Taifi
Affiliations:
Temple University, Philadelphia, PA
Venue:
ACM SIGOPS Operating Systems Review
Year:
2013

Citing 19
Cited 0

CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Decoupling as a Foundation for Large Scale Parallel Computing

HPCC '09 Proceedings of the 2009 11th IEEE International Conference on High Performance Computing and Communications
Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
Twister: a runtime for iterative MapReduce

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
See spot run: using spot instances for mapreduce workflows

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Decision Model for Cloud Computing under SLA Constraints

MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Resource Planning for Parallel Processing in the Cloud

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Achieving Performance and Availability Guarantees with Spot Instances

HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Provisioning spot market cloud resources to create cost-effective virtual clusters

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part I
SpotMPI: a framework for auction-based HPC computing using amazon spot instances

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
The resource-as-a-service (RaaS) cloud

HotCloud'12 Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing
Program Scalability Analysis for HPC Cloud: Applying Amdahl's Law to NAS Benchmarks

SCC '12 Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cloud providers are auctioning their excess capacity using dynamically priced virtual instances. These spot instances provide significant savings compared to on-demand or fixed price instances. The users willing to use these resources are asked to provide a maximum bid price per hour, and the cloud provider runs the instances as long as the market price is below the user's bid price. By using such resources, the users are exposed explicitly to failures, and need to adapt their applications to provide some level of fault tolerance. In this paper, we expose the effect of bidding in the case of virtual HPC clusters composed of spot instances. We describe the interesting effect of uniform versus non-uniform bidding in terms of both the failure rate and the failure model. We propose an initial attempt to deal with the problem of predicting the runtime of a parallel application under various bidding strategies and various system parameters. We describe the relationship between bidding strategies and programming models, and we build a preliminary optimization model that uses real price traces from Amazon Web Services as inputs, as well as instrumented values related to the processing and network capacities of cluster instances on the EC2 services. Our results show preliminary insights into the relationship between non-uniform bidding and application scaling strategies.