A first order approximation to the optimum checkpoint interval
Communications of the ACM
Xen and the art of virtualization
SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles
Convex Optimization
Virtual Machines: Versatile Platforms for Systems and Processes (The Morgan Kaufmann Series in Computer Architecture and Design)
MapReduce: simplified data processing on large clusters
Communications of the ACM - 50th anniversary issue: 1958 - 2008
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Optimization of checkpointing-related I/O for high-performance parallel and distributed computing
The Journal of Supercomputing
Enforcing performance isolation across virtual machines in Xen
Proceedings of the ACM/IFIP/USENIX 2006 International Conference on Middleware
A higher order estimate of the optimum checkpoint interval for restart dumps
Future Generation Computer Systems
Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing
Stochastic Models for Fault Tolerance: Restart, Rejuvenation and Checkpointing
A flexible checkpoint/restart model in distributed systems
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
On the Execution of Large Batch Programs in Unreliable Computing Systems
IEEE Transactions on Software Engineering
Characterization and Comparison of Cloud versus Grid Workloads
CLUSTER '12 Proceedings of the 2012 IEEE International Conference on Cluster Computing
Monetary Cost-Aware Checkpointing and Migration on Amazon Cloud Spot Instances
IEEE Transactions on Services Computing
Error-Tolerant Resource Allocation and Payment Minimization for Cloud System
IEEE Transactions on Parallel and Distributed Systems
Hi-index | 0.00 |
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.