Optimization of cloud task processing with checkpoint-restart mechanism

Authors:
Sheng Di;Yves Robert;Frédéric Vivien;Derrick Kondo;Cho-Li Wang;Franck Cappello
Affiliations:
Argonne National Laboratory and INRIA, Saclay, France;ENS Lyon and INRIA, France and University of Tennessee Knoxville;ENS Lyon and INRIA, France;INRIA, Grenoble, France;The University of Hong Kong, Hong Kong;Argonne National Laboratory
Venue:
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2013

Citing 17
Cited 0

A first order approximation to the optimum checkpoint interval

Communications of the ACM
Web Search for a Planet: The Google Cluster Architecture

IEEE Micro
Xen and the art of virtualization

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Convex Optimization

Convex Optimization
Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design)

Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design)
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

The Journal of Supercomputing
Enforcing performance isolation across virtual machines in Xen

Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
A higher order estimate of the optimum checkpoint interval for restart dumps

Future Generation Computer Systems
Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing

Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing
A flexible checkpoint/restart model in distributed systems

PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
BlobCR: efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the Execution of Large Batch Programs in Unreliable Computing Systems

IEEE Transactions on Software Engineering
Characterization and Comparison of Cloud versus Grid Workloads

CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances

IEEE Transactions on Services Computing
Error-Tolerant Resource Allocation and Payment Minimization for Cloud System

IEEE Transactions on Parallel and Distributed Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.