Timing models and local stopping criteria for asynchronous iterative algorithms
Journal of Parallel and Distributed Computing
A first order approximation to the optimum checkpoint interval
Communications of the ACM
CoCheck: Checkpointing and Process Migration for MPI
IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Egida: An Extensible Toolkit For Low-Overhead Fault-Tolerance
FTCS '99 Proceedings of the Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing
Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations
HPDC '99 Proceedings of the 8th IEEE International Symposium on High Performance Distributed Computing
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
A network-failure-tolerant message-passing system for terascale clusters
International Journal of Parallel Programming
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
CLUSTER '04 Proceedings of the 2004 IEEE International Conference on Cluster Computing
Evaluating the Performance Impact of Xen on MPI and Process Execution For HPC Systems
VTDC '06 Proceedings of the 2nd International Workshop on Virtualization Technology in Distributed Computing
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Validity of the single processor approach to achieving large scale computing capabilities
AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
High-Performance Cloud Computing: A View of Scientific Applications
ISPAN '09 Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks
Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud
CLOUD '10 Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing
See spot run: using spot instances for mapreduce workflows
HotCloud'10 Proceedings of the 2nd USENIX conference on Hot topics in cloud computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Decision Model for Cloud Computing under SLA Constraints
MASCOTS '10 Proceedings of the 2010 IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
Dynamic resource allocation for spot markets in clouds
Hot-ICE'11 Proceedings of the 11th USENIX conference on Hot topics in management of internet, cloud, and enterprise networks and services
Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing
IEEE Transactions on Parallel and Distributed Systems
Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems
Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems
Sustainable GPU Computing at Scale
CSE '11 Proceedings of the 2011 14th IEEE International Conference on Computational Science and Engineering
Banking on decoupling: budget-driven sustainability for HPC applications on auction-based clouds
ACM SIGOPS Operating Systems Review
Hi-index | 0.00 |
The economy of scale offers cloud computing virtually unlimited cost effective processing potentials. Theoretically, prices under fair market conditions should reflect the most reasonable costs of computations. The fairness is ensured by the mutual agreements between the sellers and the buyers. Resource use efficiency is automatically optimized in the process. While there is no lack of incentives for the cloud provider to offer auction-based computing platform, using these volatile platform for practical computing is a challenge for existing programming paradigms. This paper reports a methodology and a toolkit designed to tame the challenges for MPI applications. Unlike existing MPI fault tolerance tools, we emphasize on dynamically adjusted optimal checkpoint-restart (CPR) intervals. We introduce a formal model, then a HPC application toolkit, named SpotMPI, to facilitate the practical execution of real MPI applications on volatile auction-based cloud platforms. Our models capture the intrinsic dependencies between critical time consuming elements by leveraging instrumented performance parameters and publicly available resource bidding histories. We study algorithms with different computing v.s. communication complexities. Our results show non-trivial insights into the optimal bidding and application scaling strategies.