Modeling and tolerating heterogeneous failures in large parallel systems
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Estimating deadline-miss probabilities of tasks in large distributed systems
GPC'12 Proceedings of the 7th international conference on Advances in Grid and Pervasive Computing
On the checkpointing strategy in desktop grids
IDCS'12 Proceedings of the 5th international conference on Internet and Distributed Computing Systems
Searching for Translated Plagiarism with the Help of Desktop Grids
Journal of Grid Computing
Future Generation Computer Systems
A proximity-aware load balancing in peer-to-peer-based volunteer computing systems
The Journal of Supercomputing
Hi-index | 0.00 |
Frequent resources failures are a major challenge for the rapid completion of batch jobs. Check pointing and migration is one approach to accelerate job completion avoiding deadlock. We study the problem of scheduling checkpoints of sequential jobs in the context of Desktop Grids, consisting of volunteered distributed resources. We craft a checkpoint scheduling algorithm that is provably optimal for discrete time when failures obey any general probability distribution. We show using simulations with parameters based on real-world systems that this optimal strategy scales and outperforms other strategies significantly in terms of check pointing costs and batch completion times.