The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
Message Logging: Pessimistic, Optimistic, Causal, and Optimal
IEEE Transactions on Software Engineering
Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant MPI
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed
International Journal of High Performance Computing Applications
Opening black boxes: using semantic information to combat virtual machine image sprawl
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Parallax: virtual disks for virtual machines
Proceedings of the 3rd ACM SIGOPS/EuroSys European Conference on Computer Systems 2008
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
Information Sciences: an International Journal
Remus: high availability via asynchronous virtual machine replication
NSDI'08 Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation
Proceedings of the 6th ACM conference on Computing frontiers
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
A service composition framework for market-oriented high performance computing cloud
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Case study for running HPC applications in public clouds
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Scalable virtual machine storage using local disks
ACM SIGOPS Operating Systems Review
BlobSeer: Next-generation data management for large scale infrastructures
Journal of Parallel and Distributed Computing
Hybrid Checkpointing for MPI Jobs in HPC Environments
ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
VirtCFT: A Transparent VM-Level Fault-Tolerant System for Virtual Clusters
ICPADS '10 Proceedings of the 2010 IEEE 16th International Conference on Parallel and Distributed Systems
Hybrid checkpointing using emerging nonvolatile memories for future exascale systems
ACM Transactions on Architecture and Code Optimization (TACO)
Magellan: experiences from a science cloud
Proceedings of the 2nd international workshop on Scientific cloud computing
Going back and forth: efficient multideployment and multisnapshotting on clouds
Proceedings of the 20th international symposium on High performance distributed computing
On the benefits of transparent compression for cost-effective cloud data storage
Transactions on large-scale data- and knowledge-centered systems III
Optimizing multi-deployment on clouds by means of self-adaptive prefetching
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Windows Azure Storage: a highly available cloud storage service with strong consistency
SOSP '11 Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles
Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications
IPDPS '11 Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium
FTI: high performance fault tolerance interface for hybrid systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Evaluating the viability of process replication reliability for exascale systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart
ICPP '11 Proceedings of the 2011 International Conference on Parallel Processing
Performance evaluation of Amazon EC2 for NASA HPC applications
Proceedings of the 3rd workshop on Scientific Cloud Computing Date
A hybrid local storage transfer scheme for live migration of I/O intensive workloads
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Exploring the performance and mapping of HPC applications to platforms in the cloud
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Hi-index | 0.00 |
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.