BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Authors:
Bogdan Nicolae;Franck Cappello
Affiliations:
INRIA Saclay, Île-de-France, France;University of Illinois at Urbana Champaign
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 21
Cited 7

The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
PVFS: a parallel file system for linux clusters

ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Cooking with Linux: still searching for the ultimate linux distro?

Linux Journal
Opening black boxes: using semantic information to combat virtual machine image sprawl

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A view of cloud computing

Communications of the ACM
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Network state consistency of virtual machine in live migration

Proceedings of the 2010 ACM Symposium on Applied Computing
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
Case study for running HPC applications in public clouds

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Transparent redundant computing with MPI

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Scalable virtual machine storage using local disks

ACM SIGOPS Operating Systems Review
BlobSeer: Next-generation data management for large scale infrastructures

Journal of Parallel and Distributed Computing
Image Distribution Mechanisms in Large Scale Cloud Providers

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
GPFS: a shared-disk file system for large computing clusters

FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Magellan: experiences from a science cloud

Proceedings of the 2nd international workshop on Scientific cloud computing
Going back and forth: efficient multideployment and multisnapshotting on clouds

Proceedings of the 20th international symposium on High performance distributed computing

A hybrid local storage transfer scheme for live migration of I/O intensive workloads

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Resilience for collaborative applications on clouds: fault-tolerance for distributed HPC applications

ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Journal of Parallel and Distributed Computing
Optimization of cloud task processing with checkpoint-restart mechanism

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cost-Benefit Analysis of Virtualizing Batch Systems: Performance-Energy-Dependability Trade-Offs

UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.