SpotMPI: a framework for auction-based HPC computing using amazon spot instances

Authors:
Moussa Taifi;Justin Y. Shi;Abdallah Khreishah
Affiliations:
Temple University, Computer Science Department, Philadelphia, PA;Temple University, Computer Science Department, Philadelphia, PA;Temple University, Computer Science Department, Philadelphia, PA
Venue:
ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
Year:
2011

Citing 22
Cited 1

Timing models and local stopping criteria for asynchronous iterative algorithms

Journal of Parallel and Distributed Computing
A first order approximation to the optimum checkpoint interval

Communications of the ACM
CoCheck: Checkpointing and Process Migration for MPI

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World

Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance

FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations

HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A network-failure-tolerant message-passing system for terascale clusters

International Journal of Parallel Programming
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems

VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
High-Performance Cloud Computing: A View of Scientific Applications

ISPAN '09 Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks
Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud

CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
See spot run: using spot instances for mapreduce workflows

HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Decision Model for Cloud Computing under SLA Constraints

MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Dynamic resource allocation for spot markets in clouds

Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing

IEEE Transactions on Parallel and Distributed Systems
Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems

Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems
Sustainable GPU Computing at Scale

CSE '11 Proceedings of the 2011 14th IEEE International Conference on Computational Science and Engineering

Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds

ACM SIGOPS Operating Systems Review

Quantified Score

Hi-index	0.00

Visualization

Abstract

The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications. Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.