BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Authors:
Bogdan Nicolae;Franck Cappello
Affiliations:
IBM Research, Ireland;INRIA Saclay, France and University of Illinois at Urbana-Champaign, United States
Venue:
Journal of Parallel and Distributed Computing
Year:
2013

Citing 37
Cited 0

The design and implementation of a log-structured file system

ACM Transactions on Computer Systems (TOCS)
Using MPI (2nd ed.): portable parallel programming with the message-passing interface

Using MPI (2nd ed.): portable parallel programming with the message-passing interface
A survey of rollback-recovery protocols in message-passing systems

ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal

IEEE Transactions on Software Engineering
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed

International Journal of High Performance Computing Applications
Cooking with Linux: still searching for the ultimate linux distro?

Linux Journal
Opening black boxes: using semantic information to combat virtual machine image sprawl

Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Parallax: virtual disks for virtual machines

Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Information Sciences: an International Journal
Remus: high availability via asynchronous virtual machine replication

NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Scalable transparent checkpoint-restart of global address space applications on virtual machines over infiniband

Proceedings of the 6th ACM conference on Computing frontiers
PLFS: a checkpoint filesystem for parallel applications

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Cassandra: a decentralized structured storage system

ACM SIGOPS Operating Systems Review
Characterizing cloud computing hardware reliability

Proceedings of the 1st ACM symposium on Cloud computing
A service composition framework for market-oriented high performance computing cloud

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Case study for running HPC applications in public clouds

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable virtual machine storage using local disks

ACM SIGOPS Operating Systems Review
BlobSeer: Next-generation data management for large scale infrastructures

Journal of Parallel and Distributed Computing
Hybrid Checkpointing for MPI Jobs in HPC Environments

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
VirtCFT: A Transparent VM-Level Fault-Tolerant System for Virtual Clusters

ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems

ACM Transactions on Architecture and Code Optimization (TACO)
Magellan: experiences from a science cloud

Proceedings of the 2nd international workshop on Scientific cloud computing
Going back and forth: efficient multideployment and multisnapshotting on clouds

Proceedings of the 20th international symposium on High performance distributed computing
On the benefits of transparent compression for cost-effective cloud data storage

Transactions on large-scale data- and knowledge-centered systems III
Optimizing multi-deployment on clouds by means of self-adaptive prefetching

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Windows Azure Storage: a highly available cloud storage service with strong consistency

SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
FTI: high performance fault tolerance interface for hybrid systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Performance evaluation of Amazon EC2 for NASA HPC applications

Proceedings of the 3rd workshop on Scientific Cloud Computing Date
A hybrid local storage transfer scheme for live migration of I/O intensive workloads

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Exploring the performance and mapping of HPC applications to platforms in the cloud

Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds

Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.