The design and implementation of a log-structured file system
ACM Transactions on Computer Systems (TOCS)
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
Using MPI (2nd ed.): portable parallel programming with the message-passing interface
A survey of rollback-recovery protocols in message-passing systems
ACM Computing Surveys (CSUR)
PVFS: a parallel file system for linux clusters
ALS'00 Proceedings of the 4th annual Linux Showcase & Conference - Volume 4
Grid'5000: A Large Scale And Highly Reconfigurable Experimental Grid Testbed
International Journal of High Performance Computing Applications
Opening black boxes: using semantic information to combat virtual machine image sprawl
Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Proceedings of the 6th ACM conference on Computing frontiers
PLFS: a checkpoint filesystem for parallel applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Communications of the ACM
Cassandra: a decentralized structured storage system
ACM SIGOPS Operating Systems Review
Network state consistency of virtual machine in live migration
Proceedings of the 2010 ACM Symposium on Applied Computing
Characterizing cloud computing hardware reliability
Proceedings of the 1st ACM symposium on Cloud computing
Case study for running HPC applications in public clouds
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Transparent redundant computing with MPI
EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
Scalable virtual machine storage using local disks
ACM SIGOPS Operating Systems Review
BlobSeer: Next-generation data management for large scale infrastructures
Journal of Parallel and Distributed Computing
Image Distribution Mechanisms in Large Scale Cloud Providers
CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
GPFS: a shared-disk file system for large computing clusters
FAST'02 Proceedings of the 1st USENIX conference on File and storage technologies
Magellan: experiences from a science cloud
Proceedings of the 2nd international workshop on Scientific cloud computing
Going back and forth: efficient multideployment and multisnapshotting on clouds
Proceedings of the 20th international symposium on High performance distributed computing
A hybrid local storage transfer scheme for live migration of I/O intensive workloads
Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing
ICCSA'12 Proceedings of the 12th international conference on Computational Science and Its Applications - Volume Part IV
Scalable Reed-Solomon-based reliable local storage for HPC applications on iaas clouds
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds
Journal of Parallel and Distributed Computing
Optimization of cloud task processing with checkpoint-restart mechanism
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Cost-Benefit Analysis of Virtualizing Batch Systems: Performance-Energy-Dependability Trade-Offs
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.